Deliverables Report

IST-2001-33310 VICTEC

Operational Evaluation Methodology

AUTHORS: Carsten Zoll, Sibylle Enz, Harald Schaub, Sarah Woods, Nisa Silva, Lynne Hall, Polly Sobreperez, Sandy Louchart, Isabel Machado

STATUS: Draft

CHECKERS: Sue Richardson, Ruth Aylett

- 1 - Deliverable7.1.1/final Deliverables Report

PROJECT MANAGER

Name: Ruth Aylett Address: CVE, Business House, University of Salford, University Road,, Salford, M5 4WT Phone Number: +44 161 295 2922 Fax Number: +44 161 295 2925 E-mail: [email protected]

TABLE OF CONTENTS

0 PURPOSE OF THE DOCUMEMT 3 1 EXECUTIVE OVERVIEW 4 2 INTRODUCTION 7 3 EVALUATION OF THE TOOLKIT 8 3.1 Objectives and Design 8 3.2 Description of the Diagnostic Instruments of the Toolkit Evaluation 8 4 EVALUATION OF THE DEMONSTRATOR 12 4.1 Technological Evaluation 12 4.1.1 Objectives and Design 12 4.1.2 Description of the Diagnostic Instruments of the Technical Evaluation 13 4.2 Usability Evaluation 18 4.2.1 Objectives and Design 18 4.2.2 Description of the Diagnostic Instruments of the Usability Evaluation 18 4.3 Pedagogical Evaluation 21 4.3.1 Objectives and Design 21 4.3.2 Description of the Diagnostic Instruments of the Pedagogical Evaluation 22 4.4 Psychological Evaluation 33 4.4.1 Objectives and Design 33 4.4.2 Description of the Diagnostic Instruments of the Psychological Evaluation 35 5 CONCLUSIONS 42 REFERENCES 43 APPENDIX 46 A - Technical Evaluation of the Demonstrator (Tables) 46 B - Pedagogical Evaluation of the Demonstrator (Tables) 47 C - Psychological Evaluation of the Demonstrator (Tables) 49 D - Evaluation tools CD

0 Purpose of Document

This document delivers detailed information on the intended evaluation of the two pieces of software generated by the VICTEC project team members. These two pieces are the Toolkit (a generic software program developed to construct virtual 3D environments populated with autonomous and empathic virtual agents) and the Demonstrator (a software program which implements the autonomous and empathic agents in a virtual school environment to serve the purpose of anti-bullying education). Both pieces of software are to be evaluated as has already been suggested in Deliverable 2.2.2. With this document we aim at specifying the suggested evaluation tools and at providing detailed information on statistical analyses and the interpretation of the so collected data.

1 Executive Overview

With this document, we aim to specify the evaluation efforts which have already been described in Deliverable 2.2.2. The planned evaluation will take place both during and at the end of the project. There will be two peer reviewed documents at a later stage of the project reporting the results of the evaluation (D7.2.1: Evaluation of Demonstrator in schools, to be delivered in month 28 and D7.3.1: Evaluation of Toolkit, to be delivered in month 30). Deliverable 7.1.1 will provide detailed information on the planned evaluation of the software generated by the VICTEC project. Key concepts within the evaluation framework are agent believability, credibility and empathy which are to be operationalized producing measurable criteria. Therefore, the description of every tool will be divided into subgroups dealing with the detailed description of the particular measurement tool, sufficient information on its quality factors (wherever feasible) and on the application of the tool within the educational context, and the specification of procedures relevant for analyzing and interpreting the extracted data. The document embraces basically two sections: the evaluation of the Toolkit, a generic software program that enables users to create empathic and autonomous agents in virtual environments and the evaluation of the Demonstrator, a software program for children aged eight to twelve years, which is located in the PSE-context in view of providing anti-bullying strategies. The evaluation of the Toolkit focuses on the evaluation of effectiveness, efficiency and user satisfaction (ISO 9241-11) Therefore, Discount Usability Engineering techniques are being used as part of the Toolkit development process which is based on an iterative prototyping approach: Evaluation of Toolkit Goals  Evaluation of usability based on ISO 9241-11: effectiveness, efficiency, satisfaction Methods  Low fidelity prototyping: paper model of the Toolkit  High fidelity prototyping: expert walkthroughs  User testing

The evaluation of the Demonstrator divides into four different aspects (technological aspects, usability, psychological effects and pedagogical effects).

The Technological Evaluation covers the stability and hardware and software compatibility of the Demonstrator. Furthermore, it focuses on the minimum technical requirements (hardware and software) that are necessary to ensure optimal performance of the product: - 4 - Deliverable7.1.1/final Deliverables Report

Technological Evaluation of Demonstrator Goals  Stability  Compatibility  Minimum hardware requirements Methods  Internal tests  Expert’s evaluation  Potential end users’ evaluation

Also based on the ISO 9241-11 norm, the Usability Evaluation of the Demonstrator includes the quality of the interaction system, the ease of use of interface objects, satisfaction and engagement of the user, and the evaluation of the physical embodiment of the environment and the agents. Currently, the Usability Evaluation is carried out simultaneously to the development of the Demonstrator using the VICTEC trailer (a reduced version of the Demonstrator which is less interactive) as a substitute as long as there’s no existing Demonstrator. Using the trailer approach, the Usability Evaluation focuses on believability of agents, environments and storyline, agents’ likability, and ease of use: Usability Evaluation of Demonstrator Goals  Measurement of ISO 9241-11 criteria: effectiveness, efficiency, and satisfaction  Physical embodiment of the virtual environment and agents Methods 1) Iterative usability evaluation during the development of the Demonstrator 2) Large scale testing with completed Demonstrator

The Pedagogical Evaluation of the Demonstrator includes a pre-/post-test-design with four diagnostic instruments. The investigation of effects of the user interaction with the Demonstrator and the involvement of teachers (in order to asses and create acceptance for the application of the Demonstrator in schools) are main goals of the Pedagogical Evaluation: Pedagogical Evaluation of Demonstrator Goals  Effect of user interaction with Demonstrator on cognitive and behavioural bullying aspects  Involvement of teachers Methods  Teacher rating of bullying  Bullying questionnaire  Empathy questionnaire  Picture-Story (Pupil test)

The Psychological Evaluation concentrates on the investigation of interactions between the work with the Demonstrator and bullying. Other research topics focus on the psychological processes underlying coping strategies and differences between the bullying profiles in “Theory of Mind” abilities and emotion recognition abilities: Psychological Evaluation of Demonstrator Goals  Effect of user interaction with Demonstrator on cognitive and behavioural bullying aspects  Involvement of teachers to asses and create acceptance for the application of the Demonstrator in schools Methods  Bullying scenarios  Bullying questionnaire  Theory of Mind questions  DANVA (assessment of emotion recognition abilities)

2 Introduction

Evaluation within the VICTEC project is not a one-time event, but a permanent process. It covers a variety of procedures aiming at the evaluation of user interaction with the software, social immersion and empathic processes between the user and the synthetic characters and the effect of the user interaction on the social behaviour of the user. Therefore, potential user groups and experts, children and teachers, are to be considered. This document provides information on the efforts that have been taken so far in order to optimize the final evaluations of the VICTEC software. The VICTEC project is to deliver an Authoring Toolkit which serves the construction of a specific Demonstrator in the context of anti-bullying education at a later stage of the project. Therefore, as the development of the Toolkit is much more progressed compared to the development of the Demonstrator, some results on the former are already available. Nevertheless, as the evaluation is an ongoing process, further evaluation results are to be reported in the separate deliverables on the evaluation of the Toolkit and the Demonstrator to come. The section on the evaluation of the Demonstrator concentrates on the activities that have been carried out so far to assess the quality of the aimed measurements that are planned for the end of the project in 2004: following internal tests the Technological Evaluation involves experts and potential user groups; the Usability Evaluation seeks to determine if users are able to interact effectively, efficiently and enjoyably with the Demonstrator; the Pedagogical Evaluation carried out with complete school classes in the participating countries mainly aims to measure the cognitive and affective effects of the child users’ interaction with the Demonstrator with a focus on empathic processes; and the Psychological Evaluation covers the relations between bullying, coping strategies and mental representations (e.g. Theory of Mind).

3 Evaluation of the Toolkit

3.1 Objectives and Design Our aim with the evaluation of the Toolkit was to determine whether it was usable, as defined in ISO 9241-11 (International Standards Organisation, 1998), focusing on:  Effectiveness o Could the Toolkit users complete tasks using the system? o Was the quality of the output of those tasks acceptable? o Could the user easily achieve the goals that the system was intended to support?  Efficiency o Was the level of resource (mainly human effort) appropriate for the tasks performed and their outcomes? o Could user goals be achieved within acceptable levels of resource, whether this is mental energy, physical effort or time?  Satisfaction o Did the users have positive subjective reactions to using the Toolkit? o Was the user experience positive, enjoyable and satisfying in accomplishing their goals?

An iterative prototyping approach is being used for the Toolkit development, with usability evaluations occurring as part of that iteration. Each evaluation aims to identify elements that need modification, deletion and addition to enhance the usability of the Toolkit. The techniques used to perform the evaluation were Discount Usability Engineering techniques (Nielsen, 1994b) working with different levels of Toolkit fidelity (e.g. from paper based to high fidelity prototypes). The main instruments used within the current evaluations have been usability inspections, quality assurance and usability testing. Although user satisfaction has been briefly considered, no rating of this has yet taken place, this will be considered in September 2003.

3.2 Description of the Diagnostic Instruments of the Toolkit Evaluation

3.2.1 Low Fidelity Prototyping Description The low fidelity prototyping involved the creation of a paper model of the Toolkit following the approach successfully used in many domains by usability practitioners (Rudd, Stern, & Isensee, 1996; Snyder, 1996). The results of the paper prototyping exercise were presented in the form of digital photos (representing the various states of the prototype), screen shots and recommendations to the Toolkit developers.

Application The paper prototyping activity took place over 4 half day sessions over a two week period with the expert working through the various tasks of the Toolkit and making recommendations for changing the Toolkit. The paper prototyping occurred simultaneously with the initial expert walkthrough with the high fidelity prototype (see below).

Evaluation Factors The factors under consideration using the low fidelity prototype focused on the efficiency and effectiveness of the Toolkit, offering alternative designs where usability problems were identified. The factors focused on during this evaluation were:  Could the user easily achieve the goals that the system was intended to support?  Could user goals be achieved within acceptable levels of resource?

Analysis and Interpretation The results of the paper prototype highlighted a number of potential changes for the Toolkit to reduce user effort, increase speed and effectiveness. A number of problems were identified related to task structures that were difficult to follow and alternative structures were proposed.

3.2.2 Usability Inspection of the High Fidelity Prototype Description The usability inspection of the high fidelity prototype was performed through expert walkthroughs. This involved exploratory sessions with the Toolkit Versions 1 and 1.1 by experienced discount usability practitioners (Nielsen, 1992; Nielsen, 1993; Nielsen, 1994a). In the sessions, the experts attempted to perform a quality assurance of the Toolkit through performing a number of tasks and rating the effectiveness and the efficiency of the prototype against those tasks. The heuristics which were used for this evaluation were created through applying heuristics based on those of authors such as Nielsen (Nielsen, 2002) and from a competitive analysis of kar2ouche (Immersive Education, 2001). Application Initial Walkthrough: The tasks and activities used for the quality assurance of the initial walkthrough were created through an analysis of the various documents identifying the requirements of the Toolkit. The main goals for the high fidelity walkthrough were based on the requirements specification provided by the Toolkit developers. The high fidelity prototype (Toolkit Version 1) was provided to an expert who is a member of the VICTEC team. This expert has domain knowledge, an understanding of the intended aim of the Demonstrator and the Toolkit with the potential to be a user. The expert walkthrough was conducted over 4 time periods of approximately 2 hours each. The expert used heuristics to evaluate the Toolkit for the creation of virtual environments populated with intelligent virtual agents. Within these periods, the expert identified gross /

obvious usability factors, explored issues which appeared to have potential usability problems (i.e. failed to support the tasks and activities of the Toolkit) and noted structures and features which needed further investigation. Version 1.1 Walkthrough: The walkthrough with version 1.1 of the Toolkit (created after the initial recommendations from the expert walkthrough and the paper prototype) was carried out by 2 expert evaluators using a goal-following method. The walkthrough was conducted during a half day session, with each expert independently working on their evaluation. The usability debrief and aggregation of usability problems focused on the prioritisation of problems, identifying those that were felt to be most significant. A number of problems and issues were identified and quality scripts were created to test these identified problems with users. Evaluation Factors The evaluation factors under consideration with the expert walkthrough using the high fidelity prototyping focused on the effectiveness of the Toolkit. This considered whether the Toolkit could be used to partially complete a series of tasks using the system. As the early versions of the Toolkit could not generate output and the graphics and so on were not final quality, factors related to task closure in terms of display / output were ignored. The factors that were focused on are:  Could the user easily achieve the goals that the system was intended to support?  Could the Toolkit users complete tasks using the system? Analysis and Interpretation The expert walkthroughs were performed with a high fidelity prototype created at Salford. This prototype reveals the main requirements of the Salford developers (one of the intended user groups) with a focus on back-story and environment. Although it was possible to create the graphical appearance of the agent (with greeked images), tasks relating to character configuration need further consideration. The initial evaluation identified a number of usability issues and problems that required further consideration (achieved through low fidelity prototyping, as already detailed above). The results of the evaluation were presented in the form of recommendations to the Toolkit developers. The second evaluation identified a number of repeat problems that had been present in Toolkit Version 1 although most of the problems, particularly those related to gross or highly visible usability issues (such as non-functioning elements, spelling mistakes) had been resolved. Recommendations were made in relation to these usability issues. The outstanding problems related to task structures identified during the Toolkit Version 1.1 Evaluation were prioritised as being factors that must be further considered through quality script generation and user testing (see below). Comparing the Toolkit Version 1.1 with products such as Kar2ouche identified that the overall appearance of the Toolkit needed further work. Additional requirements for the Toolkit also need to be identified through further user requirements analysis.

3.2.3 User Testing Description User testing was used to test whether the problems identified by the expert evaluation could be confirmed or rejected. Each user worked through a series of quality scripts (a list of actions that the users had to perform) and discussed any issues arising from their interaction, using a Think-aloud protocol. Users were also asked about their satisfaction with using the Toolkit. Application 9 users participated in user testing using Version 1.1 of the Toolkit. The user tests involved the use of quality scripts that were developed to consider tasks and structures that had been identified as potentially problematic by the usability inspection. Usability criteria generated by the expert evaluators were provided to enable a scoring mechanism for the evaluation of the Toolkit. At this stage of the development process, the scoring is performed by the researcher observing the user interaction with the Toolkit. Each of the user tests took approximately 30 minutes, with participants being asked to discuss the Toolkit where relevant. Users were also asked for their opinions on the overall look and feel of the Toolkit. Evaluation Factors As well as effectiveness and efficiency, the evaluation factors also focused on user satisfaction for the user testing of Version 1.1 of the Toolkit. The main foci of the user testing were goal focused seeking to identify if:  The user could easily achieve the goals that the system was intended to support.  The amount of human effort was appropriate for the tasks performed.  The user experience was positive and satisfying in accomplishing their goals. Analysis and Interpretation All of the task structures identified as being problematic by the expert evaluation were found to be problematic for the majority of the users. A number of additional usability problems to those specified by the expert evaluators were identified. The various problems and potential solutions are to be provided as recommendations to the developers.

4 Evaluation of Demonstrator

4.1 Technological Evaluation

4.1.1 Objectives and Design Evaluation is an important phase of the product development to support the whole design process and to inform the development team about how well the proposed design fits the needs of the users in terms of their characteristics, the activities for which the system will be used, the environment of use and technology that supports it. Evaluation also aims to provide ways of answering questions to the developers and users expectations about the product. As the project has a great technological component, technical evaluation is very important to guarantee that the system works perfectly, eliminating the possibility that technology compromises the aim of the product/study. Different kinds of technological evaluations may be carried out at different stages of design and this task is centrally considered in the whole production process (see Figure 1).

Task analysis/ Implementation Functional analysis

Evaluation

Prototyping Requirements Specification

Conceptual design/ Formal design

Figure 1: The Star Life Cycle (Adapted from Preece, Jenny, Human-Computer Interaction, 1994, Addison-Wesley, page 596)

Reasons to evaluate:  To identify and eliminate outstanding problems with the system before going live;  To avoid difficulties during implementation;  To identify user difficulties so that the product can be more finely tuned to meet their needs;  To ensure that the system is easy to use;  To upgrade the product.  Concerning the evaluation schedule for technical evaluation see Table 1 in Appendix A.

4.1.2 Description of the Diagnostic Instruments of the Technological Evaluation

4.1.2.1 Internal Tests Description Internal Software and interface tests are developed to validate the efficiency of the system in different environments, namely with different operating systems, browsers, hardware equipments, display resolutions and Internet accesses. Application Internal tests are developed before the evaluation of experts and end users. These tests are performed in laboratory by the software developers and all the partners involved in the project to detect bugs, inconsistencies with the plan and interface problems, both in Internet and CD- Rom versions. It is expected that these tests last 2 weeks. Evaluation Factors Internet Evaluation . Programming tests: Syntax; Functional; Structure (processor and memory resources); . Ergonomics and navigation; . Connection; . Browsers compatibility; . Plug-in version required; . Screen resolution.

CD-Rom . Programming tests: Syntax; Functional; Structure (processor and memory resources); . Ergonomics and navigation; . Connection; . Different operative systems compatibility; . Screen resolution. Analysis and Interpretation Data collected from internal tests is added into a worksheet and results are analyzed. It is expected to understand and list the main problems of the system, so they can be verified after debugging. It also provides information that allows concluding about the reasons for each existing problem.

4.1.2.2 Experts’ Evaluation Description Invited software and usability experts perform this evaluation, which starts immediately after the internal tests. It is also performed in laboratory, so experts and developers may work together to improve the product. Each tester has a grid where he should note every item he finds necessary. This phase of evaluation and debugging is expected to last 4 weeks. Application Technological Evaluation The expert evaluation comprises people external to the project, who have never seen the product and do not have any knowledge about it. The rationale for employing this group of users is due to the fact that sometimes programmers and software designers tend to utilize the system in stereotypical ways. This methodology is important in order to identify bugs and to implement modifications (buttons, explanations or animations) if necessary. Experts are invited to test the final product so they can better identify technological problems of the software. In a first phase they evaluate on their own, and in a second phase it is expected that they compare findings and provide some feedback on the interface. Between two weeks and one month, depending on the product’s size, experts and the development team find and correct bugs till they assume that the product is really finished and ready to use.

Heuristic Evaluation Usability specialists judge whether each element of a user interface follows established usability principles, such as: . Visibility of system status; . Match between system and the real world; . User control and freedom; . Consistency and standards; . Error prevention; . Recognition rather than recall; . Flexibility and efficiency of use; . Aesthetic and minimalist design; . Help users recognize, diagnose, and recover from errors; . Help and documentation. Evaluation Factors . System navigation; . Screen design and its ease of use, clarity and efficiency; . Effectiveness of screen, help panels and error messages; . Complexity of the keyboard and other input devices; . Standard verification. Analysis and Interpretation Experts perform their evaluation by filling in a grid with the main factors considered (listed above). This data is added into a worksheet listing the detected problems not covered in the internal tests, so they can be verified after the second phase of debugging. They also help the team, already familiar with the project, to detect aspects that are not easily understandable or sequences that may be incorrect. With these results improvements can be added to the project, bringing value to the previous version.

4.1.2.3 Potential End Users (teachers, children and parents for the VICTEC project) Description As soon as developers and experts have evaluated the final product, potential users such as teachers, parents and children will also be able to test it. These tests are performed in the field, which means directly in participant schools, integrated in the community environment. Users exploit the program and perform several tasks conducted by the experimenter, who take some notes on the users’ interaction. At the end of the evaluation session, questionnaires are distributed to parents and teachers and interviews are made to catch children opinions. This phase is supposed to last 2 weeks.

Application Performance measurement – Direct Observation Performance measurement is a technique where the user performs predefined tasks and quantitative data is obtained. With this technique it is expected to collect quantifiable data, possibly together with some subjective, qualitative information. Users’ tasks have to be carefully planned so a correct measure can be taken to achieve the intended usability goals. Different tasks are defined for different kind of users (children, parents and teachers). The test environment is set up in participant schools and the pilot studies are performed. Just before the test starts, users are introduced to the test session. The purpose of the test is stated, as well as the content of the test and what the user is supposed to do. During the test session, the experimenter asks the user to perform some tasks and will note if the user gets into difficulties (generally he should not help). Questionnaire/Interviews After the tests, users (teachers and parents) are asked to fill in a questionnaire and children answer some questions stated by the experimenter. This allows querying users about their experiences, problems and preferences. The user is also asked for comments about the system and for any suggestions he or she may have for improvement. Evaluation Factors In this evaluation phase several technical and usability factors will be evaluated. Examples of important factors are expressed below. Technical Factors . Which operating system is the user using? . Which hardware equipments are available in user’s computer? . What is the display resolution he/she is using now? . Which browser is he/she using? . Which Plug-In’s are available on the user’s computer? . Which type of Internet access does the user have? Usability Factors . Are the words, phrases and concepts used familiar to the user? . Is information presented in a simple, natural and logical order? . Does the user find the marked exits easily (when the user finds themselves somewhere unexpected)? . Is there a consistent look and feel to the system interface? . Does the users understand colour and style conventions followed for links (and no other text)? . Is the design simple, intuitive, easy to learn and pleasing? - 16 - Deliverable7.1.1/final Deliverables Report

. Is it easy to remember how to use it? . Are graphics and objects irrelevant, or contain unnecessary and distracting information? . Is navigational feedback provided (e.g. showing a user's current and initial states, where they've been and what options they have for where to go)? . Is the distance between targets (e.g. icons) and the size of targets appropriate (size should be proportional to distance)? . Is the website fun to use? Analysis and Interpretation From Technological and Usability Evaluation it is expected to collect data referring to the user interaction with the system and the system behaviour in different environments. Data collected by the end users tests will be considered as a priority to carry out the analysis of the final product performance. These data will result from the questionnaires distributed to teachers and parents and by the experiments observation during the end users interaction. For each user a grid with the different variables and metrics will be filled in. After collecting these data, an analysis will be performed to understand the product performance and identify improvement needs. A statistical treatment will be carried out to reach more detailed conclusions.

4.2 Usability Evaluation

4.2.1 Objectives and Design A participatory design approach (Gould, Boies, & Clayton, 1991) is being used to guide the VICTEC development process. This requires early, regular and continuous input from users throughout the design and implementation phases, thus allowing users to make a significant contribution to the final product. Within the Usability Evaluation of the Demonstrator we are seeking to determine if users are able to interact effectively, efficiently and enjoyably with the Demonstrator. A significant problem for VICTEC is the existence of a stable Demonstrator; however, the various elements of the VICTEC Demonstrator (virtual environment, agents, interaction mechanisms and components) have been assessed through transforming them to a high fidelity prototype. This consists of a 3D environment populated by animated characters that have a number of props to use in a bullying scenario. The character's emotional behaviour is transmitted through predefined animations and pre-recorded audio (using actors’ voices). The facial expression for the characters is obtained through the use of a relatively simple approach where dynamically modified textures express the character’s current emotional state.

4.2.2 Description of the Diagnostic Instruments of the Usability Evaluation

4.2.2.1 Trailer Approach and Questionnaire Description The VICTEC trailer is an interactive demonstration of a single episode of a bullying scenario using the Demonstrator. The trailer was presented at a seminar discussing bullying and was followed by the completion of a questionnaire The script of the trailer is based on a single episode from a detailed scenario about bullying and victimisation behaviour, devised by the Psychology team with the assistance of experts in the field. Scripted camera movements and text changes contribute to the display of the events depicted in the trailer episode. The user can interact with certain elements of the Demonstrator in the fulfilment of the scenario. The interaction is somewhat restricted with only limited interaction possible with the various animated characters using a directed point and click approach with options provided for users. The questionnaire used to evaluate the trailer was divided into 8 main sections and was predominantly measured according to a 5 point Likert scale (except for section 1 which used forced choice):  Section 1: This explored whether people had a preference for cartoon or realistic characters and which character out of the 3 displayed they preferred (Luke the bully, John the victim, or Martina the bystander). Respondents were asked to rate according to 1 = most preferred, 2 = OK and 3 = least preferred.  Section 2: This examined a) the characters’ voices for believability (believable to unbelievable) and likeableness (likeable to unpleasant), and b) the content of - 18 - Deliverable7.1.1/final Deliverables Report

conversation used in terms of believability (believable to unbelievable), interest levels (interesting to boring) and how true to life the characters were (true to life to false).  Section 3: This addressed character movements in the Demonstrator for believability (believable to unbelievable), realism (realistic to unrealistic) and smoothness (smooth to jerky).  Section 4: The nature of the school environment displayed was assessed in terms of attractiveness (attractive to unattractive) and whether it matched the appearance of the characters (matches characters to does not match characters).  Section 5: The bullying storyline was examined for believability (believable to unbelievable) and length (right length to too long).  Section 6: This enquired about users’ feelings after the demo. Users were shown 5 faces ranging from very happy to very sad and were asked to state the reasons for feeling that way.  Section 7: Usability concerns with the demo were dealt with here in terms of ease of use (easy to use to hard to use), how fun the demo was (fun to boring) and how fast the demo was (fast to slow).  Section 8: Unstructured open-ended questions were asked pertaining to ‘what the best thing about the demo was’, ‘what the worst thing about the demo was’ and ‘whether there were any further comments about the demo’. The results of the questionnaire were analysed by examining frequency distributions for questions that employed Likert scales using Histograms. Chi-square tests in the form of cross- tabulations were calculated to determine relationships between different variables. Application 76 questionnaires were completed. 19 were male (25.3%) and 55 were female (73.3%). The age of the sample ranged from 10 to 55 with a mean age of 33.83 (SD: 14.98). The sample comprised predominantly of teachers from a wide range of primary and secondary schools in the South of England, although some children and parents did complete the questionnaire. The questionnaire was administered during a ChildLine conference held at the Business centre, Islington, London, England on 25th March 2003. A seminar was given to a diverse audience regarding the nature of bullying behaviour in schools followed by the Trailer. After the Trailer was shown, the questionnaire was explained to the audience and each delegate was asked to complete a questionnaire Evaluation Factors The quality of the interaction was assessed in terms of agent physical appearance, environment appearance and effectiveness of the Demonstrator in terms of comprehensibility and believability of the story. Analysis and Interpretation The use of the trailer approach allowed us to obtain the input of a range of secondary and tertiary users, typically teachers and employees from educational support agencies. Although there were some children within the target VICTEC age group (8-12) this represented only a - 19 - Deliverable7.1.1/final Deliverables Report

small part of the sample. To facilitate analysis, the participants were categorised into three age groups 10-18, 19-40, 41+. No gender differences were found in the sample, although there were differences that were related to age. The overall comprehension of the story was good, with only 11% selecting either of the incorrect responses. Most of the respondents found the script of the story believable, with the content of the storyline being found as highly interesting and true to life, with an overall acceptability rating of 73%. This suggests that the scenarios being used and the way in which they are being displayed is effective for users. The overall appearance of the Demonstrator (i.e. classrooms, desks, etc.) being found to be attractive and well matched to the characters, suggesting that the approach being taken to the physical appearance of the Demonstrator is appropriate. The agents’ physical aspects received poor ratings in general, with most users finding the movements unbelievable, jerky and unrealistic and a similar view was held of the voices. The believability, likeability and so on of the physical aspects of the characters appears to have limited impact on the story comprehension and believability, with only a small trend observed for voice acceptability and story believability. The believability of the story line or plot is strongly related to both the attractiveness of the visual environment and its match with the characters, suggesting a need for coherence and consistency with real world situations. This evaluation of the VICTEC Demonstrator suggests that VICTEC provides a suitable mechanism for the exploration of bullying issues and suggests that the approach so far taken is appropriate. The lack of importance of physical characteristics is an important discovery as this will reduce the need to spend extensive effort on factors such as voice and movement. The results also suggest empathy with participants transposing their own feelings onto the characters whilst accepting the limitations of technology in relation to the characters physical aspects.

4.3 Pedagogical Evaluation

4.3.1 Objectives and Design The Pedagogical Evaluation carried out with complete school classes in the participating countries mainly aims to measure the cognitive, behavioural and affective effects of the child users’ interaction with the Demonstrator. Cognitive effects relate to  the question of whether the child users learn something about bullying processes when interacting with the Demonstrator;  psychological processes that play a role in bullying situations;  strategies that help to avoid bullying. The evaluation of the cognitive effects is the key part of the evaluation. The evaluation of the behavioural effects deals with the question of whether there are more, less or other types of bullying taking place in school classes after the interaction with the Demonstrator. Finally, two types of possible emotional effects are measured. Firstly, the measurement of users’ satisfaction within the Usability Evaluation provides information as to whether the interaction with the Demonstrator is a positive or negative emotional experience for the users. Secondly, we will measure effects of the Demonstrator interaction on cognitive and affective empathic reactions of the users. In addition to the measurement of the effects of the interaction, the evaluation pursues the goal of strengthening teachers’ involvement. This is important as teachers will have to work with the final version of the Demonstrator in their classes which will require high teacher cooperation and maybe provide us with information on how to improve the Demonstrator and how to integrate it in school curricula. The evaluation is designed as a pre- and posttest measurement with the child users’ interaction with the Demonstrator in between. The same four diagnostic instruments will be applied in pre- and posttest: Teacher Rating, Bullying Questionnaire, Empathy Questionnaire and Picture-Story (see below). The evaluation will be conducted as a field study in the three participating countries (Portugal, UK and Germany) in four school classes each. To meet the requirements of the target group the subjects will be aged eight to twelve and will be from rural as well as from urban schools. Overall approximately 300 pupils will participate in the Pedagogical Evaluation. The Pedagogical Evaluation is scheduled to commence in February 2004. A detailed timetable is shown in Appendix B (Table 2). It has to be noted – and that is part of the VICTEC contract (Annex 1 – Description of work) – that the Pedagogical Evaluation must remain exploratory for the following reasons:  Due to the nature of field studies, it is impossible to control all variables that may have an impact on the dependent variables. Nevertheless, we decided to carry out the study in the field and not in the laboratory because this provides us with information on the feasibility of the Demonstrator in the natural environment and gives us the opportunity to strengthen teachers’ involvement.

 The sample size of the evaluation is limited which leads to small expected effect sizes: o Since the Demonstrator is a new development we can not be sure that it has the intended effects which are an increased understanding of bullying processes and a decrease of bullying behaviour. Thus we have to carry out the evaluation with a limited number of pupils and be careful about possible negative effects. o The Pedagogical Evaluation has to be integrated into the curriculum of the participating teachers. Therefore, the time and the number of teachers participating are limited.  No control groups are involved in the Pedagogical Evaluation.

4.3.2 Description of the Diagnostic Instruments of the Pedagogical Evaluation The design of the four diagnostic instruments was guided by three objectives. 1. The combination of the instruments should cover the three effects of the interaction with the Demonstrator (cognitive, behavioural, and emotional). 2. The instruments should be simple to apply which means they should be easy for children to understand and independent of complicated technical devices. 3. The instruments will be appropriate for use in group contexts (school classes). This is essential, because if we were to use instruments that could only be applied in a single context we would have to work with the subjects sequentially. In this case it would be impossible to avoid communication of the pupils between sessions.

4.3.2.1 Picture-Story (“Pupil Test” in D2.2.2, see Appendix D.1 on CD) Description Subjects receive a picture-story dealing with relational bullying (see Appendix D.1 on CD) and are asked to answer multiple-choice questions concerning the content. The Picture-Story shows two girls bullying another girl (alleging that she smells and never washes her hair). The bullied girl attempts to cope with the situation by being sarcastic. The success of her coping attempt remains unclear. The subjects have to answer questions concerning what happens in the story, what they think the protagonists feel, what they feel while perceiving the story and how bad they think the situation is. Application The Picture-Story and the questions to answer are given to the subjects in printed form. The instrument can be applied to entire school classes at any one time. The experimenter shows the Picture-Story on a slide and explains the story to the subjects without going into detail and without providing any interpretations. After that, the subjects are asked whether there is any confusion. If not, the questions and answers are read aloud and subjects are asked again if everything is clear to them. They are then asked to complete the questions. The Picture-Story application takes approximately 15 minutes.

Quality Factors The objectivity of the instrument is high. The application and data collection is highly standardized as there are the same question and answering possibilities for all subjects (multiple-choice). The fact that the Picture-Story is presented by the experimenter represents a minor problem for the application objectivity. His/her task is not to go into detail and to avoid any interpretations. It is not possible to have a measure for the reliability of the instrument for two reasons. Firstly, no parallel form of the instrument exists, so it is not possible to calculate parallel test reliability. Secondly, since the instrument is constructed as a measure sensitive to changes, it does not make sense to calculate retest reliability. Several indicators for the validity of the instrument exist, based on a preliminary study that was conducted with 94 German pupils (43 girl and 51 boys) aged 8 to 12 (mean age: 9.67; SD: 1.25). Beyond the Picture-Story the children answered the Empathy Questionnaire (see Appendix D.3 on CD) and the Personality Questionnaire for Children developed by Seitz & Rausche (PFK, 1992). Firstly, the interpretation of the Picture-Story is a difficult task for the subjects. The answers to the question “What are the two girls in the first picture doing?” are very diverse (see Figure 2). The fact that a certain number of subjects chose “recommend to wash her hair” shows that it is not all that easy to answer this question.

30 25

20 Tease the girl 15 Hassle the girl 10 Recommend to wash her hair Devastate the girl 5 0 What are the two girls doing in the first picture?

Figure 2: Answer to “What are the two girls doing in the first picture?” (n=94)

Secondly, subjects seem to be able to put themselves in the position of the protagonists of the story. The answers to the questions on how the protagonists might feel reveal that most of the subjects are able to answer this question plausibly (see Figure 3) and to distinguish between different emotions and protagonists.

16 14

12

10

8

6

4

2 0 e g d l y d d d ly y y l v n u fu k e a e e p r fu a o o r c v s t n p g r r r r e lu e c o a n a b t p e ri je l h a e s h g e n f c g d u a

Perpetrators Victim Self Figure 3: Answers to “How do the perpetrators/does the victim feel?” and “How do you feel when reading the story?” (n=94)

Finally, as expected, highly empathic subjects have a slightly different understanding of the story. For example, highly empathic children think that the bullying taking place in the story is more serious as compared to lowly empathic subjects (see Table 3 in Appendix B). For the empathy measurement the Empathy Questionnaire (see Appendix D.3 on CD) was used. Analysis and Interpretation Quantitative and qualitative statistical analyses will be carried out on the data. The qualitative analysis focuses on the plausibility of the answers and can be interpreted as evidence for cognitive and affective abilities (or “competencies”) of the subjects. The score for cognitive ability can be extracted from the answer to the question about what is happening in the story. The affective ability can be extracted from the answers to the questions about emotions. Thereby, the appropriateness of the answers (rated by experts) and the amount of appropriate answers is taken into account. The comparison between pre- and posttest results requires a quantitative approach and will provide information about the effects of the subjects’ interaction with the Demonstrator and whether there are changes over time in terms of increased awareness about bullying behaviour and the strategies that can be used. This can be analysed statistically with a t-test. Secondly, a change in the number of emotions correctly assigned to the protagonists shows whether the empathic abilities of the subjects have improved due to interaction with the Demonstrator. An increase in the amount of appropriate emotions concerning the subjects themselves might be interpreted as an increase in their affective empathic abilities.

4.3.2.2. Teacher Rating (see Appendix D.2 on CD) Description There is evidence from literature that aggressiveness and empathy are negatively correlated. In their meta-analysis on empathy and aggression, Miller & Eisenberg (1988) found a highly significant common correlation between self-reported empathy and aggression, with the negative relation being “…consistent across the entire range of ages of the samples, from middle childhood to adulthood” (Miller & Eisenberg, 1988, p.329). As for teacher rated aggressive behaviour, Bryant (1982) reasoned that “…an increase in empathy […] is associated with a decrease of aggressive behaviour with classmates as judged by their teachers” (Bryant, 1982, p.421). We plan to implement a teacher rating in order to complement the child self-completion Bullying Questionnaire. Teachers will be asked to rate pupils’ bullying status on a scale of five from very often to never. Following the structure of the Bullying Questionnaire, there will be different scales assessing the different types of bullying, namely physical, verbal and relational bullying (see Appendix D.4 on CD). Teachers will rate the quality and quantity of victimization and bullying behaviour for every single pupil (see Appendix D.2 on CD). Application The rating is provided in printed form. Teachers are asked to think of the last month while rating. This ensures comparability to the Bullying Questionnaire which covers the identical timeframe and accounts for potential changes in pupils’ bullying behaviour in between pre- and posttest investigations. Prior to the rating, teachers are “trained” in detail with regard to the concepts of physical, verbal and relational bullying and what exactly is meant by the end points of the scale such as “not much”. Pupils are rated in random order. As there are only six decisions to be made, the application should not take longer than 1-2 minutes per pupil totalling 30-60 minutes per average class size of 30 children. Quality Factors With the Teacher Rating we can obtain valuable data from an adult perspective with the aid of an instrument which is easy to apply. Teachers can be viewed as experts in terms of children’s behaviour within the classroom. Nevertheless, there are some restrictions to the instrument’s objectivity (Schaefer, 1996). Teachers do not always detect bullying behaviour – firstly, they are not always present when bullying behaviour occurs (e.g. in the playground), and secondly, children tend to conceal bullying behaviour in order to avoid punishment. And even if teachers detect bullying behaviour, they cannot always pass as impartial judges due to their diverse and complex relationships to pupils. Still, it’s this very perspective that allows us to compare the teacher’s perception of bullying processes to the children’s perspective. Keeping these constraints in mind, differences between the Bullying Questionnaire and the Teacher - 25 - Deliverable7.1.1/final Deliverables Report

Rating are to be expected. Besides, as far as the impact of the Demonstrator on bullying behaviour is concerned, the most valuable information to be extracted from the ratings will be on the difference between pre- and posttest, which may not be too strongly affected by the absolute status of bullying / victimization ascribed to the single child compared to other children. Obviously, the teacher’s specific relationship to a certain child is not too likely to change substantially in between the two investigations. However, it would be fatal to exclude teachers from the Demonstrator Evaluation as they can provide an additional, expert perspective on social processes within the classroom. As Glow, Glow and Rump (1982) put it, “for the bulk of children’s behaviour problems, […] one is forced to rely upon the information given by significant adults about the child’s typical behaviour” (Glow, Glow & Rump, 1982, p.34). Concerning the instrument’s reliability, retest reliability can not be assessed, since the rating is to measure pre-/posttest differences. However, according to reliability coefficients reported in literature teacher ratings seem to be a reasonably reliable tool to apply. Maumary-Gremaud (2000) identifies coefficients for inter-rater agreement between r=.67 and r=.89 regarding the ratings of three teachers concerning the academic motivation, academic performance, adult relationships, conduct, personal maturity and social skills of 463 6th graders. Van den Bergh & Marcoen (1999) report significant correlations between Harter’s Self-Perception Profile for Children (SPPC) and Harter's Teacher Rating Scale of Child's Actual Behaviour (TRS) for almost all subscale scores in all gender and grade groups for a Dutch-speaking Belgian sample of 4th, 5th and 6th graders. Minogue, Kingery & Murphy (1999) provide a compilation of different teacher ratings of violent behaviour in schools with internal consistency coefficients ranging from r=.73 to r=.98 plus one coefficient for inter-rater agreement between teachers of r=.72. And finally, Ledingham, Younger, Schwartzman & Bergeron (1982) compared the ratings of 801 1st, 4th and 6th graders with the ratings of 30 teachers in Montreal, using the Pupil Evaluation Inventory by Pekarik et al. (1976) which assesses children’s aggression, withdrawal and likability. They found the inter-rater agreement to be higher between peer ratings and teacher ratings than between self-ratings and teacher / peer ratings. Especially children with highly deviant scores described themselves as less aggressive and withdrawn and more likable than did their teachers and peers. Highest inter-rater agreement was found for aggression ratings between peers and teachers (r=.65 to r=.83). The authors ascribe this to effects of high perceptual salience of aggressive behaviour on the inter-rater agreement. The inter-rater agreement was not affected by pupils’ growing cognitive maturity. In terms of validity, Brown et al. (1996) compared a teacher rating of children’s aggression to children’s social status among peers. They calculated the correlation between children’s aggression rated with the Revised Teacher Rating Scale for Reactive and Proactive Aggression and negative peer status and found significant correlations for both proactive and reactive aggression (Brown et al., 1996, p. 478). Additionally, both factors were significantly correlated with in-school detentions for inappropriate behaviour, physical & verbal aggression, disrespectful behaviour and destroying property. Analysis and Interpretation - 26 - Deliverable7.1.1/final Deliverables Report

Ratings prior to and after the interaction with the Demonstrator will be assessed. This means that the ratings can give useful information on whether – according to the teachers – the bullying behaviour of children has changed after the application. If the Demonstrator is to be accepted by teachers and school authorities, it is crucial to evaluate the teachers’ opinion on its potential. We hope to gain insight in whether there are differences between the bullying types in terms of benefiting from the interaction with the Demonstrator and, together with the data collected by the Bullying Questionnaire, whether these differences show in the children’s self-reports on bullying as well. Furthermore, the teacher-rated bullying behaviour can be added up to a total score indicating aggressive behaviour against other pupils (as perceived by teacher) which might then be associated with low self-reported empathy as suggested by the literature cited above. Such a finding would further support the notion that the promotion of empathy can inhibit aggressive behaviour against others. And last but not least, together with the data collected by the Bullying Questionnaire, we might gain fruitful insights into the differences between teachers and children in perceiving and judging bullying behaviour and victimization (see Schaefer, 1996). Despite the objectivity constraints, which also affect the interpretation of the collected data, we think that if the Demonstrator is to be accepted by schools, it is essential to get teachers involved in the evaluation. After all it is they who need to be aware of the Demonstrator’s potential in reducing bullying behaviour.

4.3.2.3 Bullying Questionnaire (Appendix D.4 on CD) Description The Bullying Questionnaire is a self-report assessment to determine information about prevalence of bullying behaviour at school and among siblings. It enables both sociometric status to be determined (whether children are popular, average, rejected, neglected, controversial) and bullying status (direct and relational bullies, victims, bully/victims and neutral). The following data is collected:

 Information about social relationships in the school class (“pupils you like/dislike”)  Siblings of the subject  Hobbies of the subject  Self-assessment of the physical strength of the subject  Bullying in school (type, names of bullies, victims, role of the subject). The Bullying Questionnaire is also part of the Psychological Evaluation. Its application procedure and quality factors are described in that section. Application See Psychological Evaluation.

Quality Factors See Psychological Evaluation. Analysis and Interpretation In the first instance it is important to compare the results of the Bullying Questionnaire to the results of the Teacher Rating. This will show whether there are great differences in the perception of bullying between teachers and pupils. Research from Schaefer (1996) suggests that the teachers’ perception of bullying processes is very different from the pupils’ perception. If we find these differences as well, the question of how to deal with that arises. One may suppose that the pupils’ perception is more valid because they experience bullying themselves. Furthermore, perpetrators may try to conceal their bullying from the teachers, a strategy that could be successful especially for relational bullying. So, if we find differences in the perception of bullying processes between teachers and pupils, the information provided by the (pupils’) Bullying Questionnaire should be weighted more strongly than the data gathered in the Teacher Rating. Nevertheless it is very important that the interaction with the Demonstrator shows effects in the Teacher Rating as well, because the teachers will decide on the application of the Demonstrator in their class.

The Bullying Questionnaire mainly focuses on the pre/posttest-comparison regarding the frequency difference between bullying profiles according to bullies and victims for direct and relational bullying (see Table 4 in Appendix B).

A detailed picture of the changes in bullying processes due to the interaction with the Demonstrator will emerge. For the statistical analysis t-tests, Wilcoxon-tests and χ2-tests will be applied. This allows investigating whether there are effects of the Demonstrator interaction and what effects there are (e.g. effect only for relational, but not for direct bullying). Furthermore, we will investigate whether certain effects of the interaction depend on other variables such as the effect of the subject’s level of empathy. We will investigate, if highly empathic subjects will profit more from the interaction or less. It would be plausible to assume that highly empathic subjects profit more from the interaction, because they are able to put themselves in the position of the virtual protagonists of the bullying stories better than less empathic subjects; on the other hand, highly empathic subjects bully much less even in the pretest so that the effects of the interaction could be weak due to a base-rate phenomenon. To analyse this we pool sub-groups of subjects with high respectively low empathy characteristics. It is important to consider whether the effects of the interaction with the Demonstrator depend on the order of events. Since the story-telling of the Demonstrator is based on emergent narrative, the orders of events are not predictable. This means that the interaction differs for each subject. If we find out that for example all children who experienced an interaction with a virtual teacher had a much better understanding of bullying processes, it might make sense to install an obligatory interaction with a teacher. To investigate this, sub-groups of subjects regarding their interaction with the Demonstrator are established. Then we will analyse

whether there are differing effects that show in the data gathered with the Bullying Questionnaire.

4.3.2.4 Empathy Questionnaire (see Appendix D.3 on CD) Description Empathy: A short review of the approaches to define empathy will be provided (see Deliverable 2.2.2 for further details). We agree with the view posited by Preston (2001), that empathy is “any process where the attended perception of the object's state generates a state in the subject that is more applicable to the object's state or situation than to the subject's own prior state or situation” (p.4). This subsumes either understanding cognitively what the object feels (cognitive empathy) or actually feeling something due to the perception of the object’s emotion (affective empathy), or both. Today, most empathy researchers agree on empathy as comprising these two major dimensions. Beyond this distinction, we want to differentiate between two ways of mediation: empathic processes can be mediated via situational or via expressional cues (facial expression, gesture, posture, and paraverbal parameters, e.g. voice- pitch, speech-rate, and physiological parameters, e.g. flush). Relevance of Empathy Measurement for VICTEC: With empathy being a key concept within the project, not only the empathy of virtual agents, but also empathic abilities of child users are concerned. These empathic abilities of users should increase through interaction with the Demonstrator by the users gaining greater insight into social and psychological processes that are involved in bullying. To be able to assess these variances, the Empathy Questionnaire has to be sensitive to changes. Furthermore, the aim of the empathy measurement within VICTEC is to investigate whether interrelations exist between empathic abilities and bullying type. For example, it might be possible to support previous research (Sutton, Smith & Swettenham, 1999) as to whether pure bullies are rather skilled in terms of cognitive empathy while lacking affective empathic abilities. Structure of the Empathy Questionnaire: The combination of the two characteristics “empathic quality” (cognitive, affective) and “way of mediation” (via situation, via expression) results in four subscales plus one scale assessing ideomotoric empathy. The latter concept extends empathic processes to movements: Perceiving another person’s movements can accordingly trigger a tendency within the observer to perform that movement themselves. We think that this approach to empathy will provide new insights, especially within the bullying context. For a detailed overview with examples of items, see Table 5 in Appendix B. The current version of the Empathy Questionnaire consists of 51 items 22 of which have been adopted from Bryant’s (1982) Index of Empathy Measurement for Children and Adolescents. Bryant’s questionnaire is based on the Mehrabian & Epstein adult measure of emotional empathy (1972). The Bryant items have been supplemented by two items from Leibetseder (2001) and items developed by the VICTEC team. This was necessary to obtain items which cover all hypothesized aspects of empathy. All items have been translated into English, German and Portuguese. A five-point response format (“I strongly agree” – “I strongly

disagree”) is used. There might be some modification of the questionnaire’s structure and item classification (as outlined in Table 5) after the validation is finished. Application The Empathy Questionnaire is available in printed form. The investigation is to be conducted by psychologists with the teachers present. The items are read out aloud to the children one by one trying to avoid too much variance in processing time due to different reading performance among pupils. Children are explicitly invited to ask questions and comment on the items in order to single out items that children find hard to understand. There are also examples and explanations delivered whenever required. The experimenters also need to remind the children to contemplate which answer would be suitable for them without considering the teacher’s or the other children’s opinion on the item. Finally, it is crucial to explicitly emphasise that the answers will be treated confidentially. Quality Factors Objectivity: Objectivity of the instrument is ensured as the methodology is highly standardised. All subjects answer identical questions, and with the questions being read out aloud, even the processing time for every item is standardised. In order not to violate these conditions, it is important for the experimenters to mind that all subjects are ready to answer the next question as well as to regularly inquire whether there are any questions or confusion. Reliability: There is some evidence for the reliability of the questionnaire from a preliminary study using the instrument on a sample of 234 German children aged 8 to 13 years (with only one child being 13), with a mean age of 9.62 years (SD: 1.07). The German sample consists of 109 males and 125 females. The internal consistency (Cronbach’s ) for the empathy total score is .84, which can be judged as highly satisfactory. The coefficients for the empathy subscales are somewhat weaker (also depending on the relative shortness of the single subscales), but still acceptable (see Table 6 in Appendix B). Again, these coefficients might undergo some changes as soon as the validation of the questionnaire is completed. Validity: All items are currently judged by experts concerning their compatibility to the underlying concept of empathy. A software program has been developed to determine which of the five hypothesized categories suits the item. The expert validation is being carried out by a group of experienced research psychologists. All experts are judging each item twice. First they assign them to cognitive, affective or ideomotoric empathy, and in a second phase they again sort out all items concerning expressional or situational way of mediation. In both rounds there is an opportunity to opt for an “other category”. The item sequence is the same in both phases and for all expert judges. The results so far indicate that the majority of the 51 items can be clearly ascribed to one of the five empathy subscales. However, some items show debatable expert classification, but as the validation is not yet completed it would be too soon to draw final conclusions. Concerning the empathy score’s association to other, already established constructs, we found a zero correlation to Need for Aggression and Opposition and a slightly negative correlation to Emotional Irritability, both measured by the Personality Questionnaire for Children - 30 - Deliverable7.1.1/final Deliverables Report

developed by Seitz & Rausche (PFK, 1992). These findings, which are also based on the sample of 234 German children aged 8-13, supported expectations. Further indicators for the validity of the instrument can be derived from the correlations to the Picture Story. As already mentioned, a preliminary study with 94 German boys and girls (3rd to 5th grade) demonstrated a significant correlation between empathy (assessed with the Empathy Questionnaire) and the judgement of seriousness of a bullying situation in a Picture Story, indicating that children with higher (affective) empathic abilities show greater comprehension of ambiguous social situations (see 4.3.2.1). Thinking of empathy as a socially desired attribute, it is necessary to assess the children’s tendency to conform to such expectations. One item “I sometimes have dirt on my shirt” has been included to gauge the tendency to conceal the truth for the sake of socially desired self- presentation. In our study of 234 German pupils, only 6.8% of all pupils denied ever having dirt on their shirt. In a much smaller sample of UK pupils, not a single pupil (neither strongly nor somewhat) disagreed to the item (see Figure 4).

60

50 I strongly agree 40 e

g I somewhat agree a t

n 30 e I don't agree or disagree c r e

P 20 I somewhat disagree

10 I strongly disagree

0 Germany (n=234) UK (n=31)

Figure 4: Percentage of answers to the item “I sometimes have dirt on my shirt” At this point, data from Portugal is still missing. Therefore, the efforts to prove the instrument’s reliability and validity made so far mainly focus on the German sample. Regarding UK data, we can only make cautious statements as the sample comprises only one class.

Analysis and Interpretation One major aim is to investigate whether the interaction with the Demonstrator has any impact on self-reported empathy. It can be hypothesised that the focused engagement in bullying scenarios as it is provided by the Demonstrator enhances empathic abilities leading to a better understanding of social processes such as bullying. The pre-/posttest design may as well

provide insight into differences between children scoring high or low on empathy. One can reasonably assume that children with high empathic abilities show greater learning effects than less empathic children as they are able to use their empathic skills to comprehend the bullying situations and transfer the content to real-life situations far more easily than children who lack these skills. But since some findings suggest that high empathy is associated with less aggressive behaviour (Miller & Eisenberg, 1988), interacting with the Demonstrator may show no effect on users with initially high levels of empathy (ceiling effect). Aside from this, the different aspects of empathy (operationalized by the different subscales) might be variably affected by the application of the Demonstrator. Due to the fact that there is no comparable empathy questionnaire taking these distinctions into account, the results of the evaluation are of general interest at this point. Secondly, the interrelation between bullying type and self-reported empathy is to be analysed. As already mentioned, there is the notion that some bullying types are characterized by relatively high levels of cognitive empathy while lacking affective empathic abilities (Sutton, Smith & Swettenham, 1999; Arsenio & Lemerise, 2001), a combination which leads them to become experienced tyrants who know exactly which weapon to choose in order to achieve maximum damage in a victim. Sutton, Smith & Swettenham (1999) state a “need for further research into cognitive skills and emotion understanding in children who bully” (p.435) with important implications for anti-bullying strategies. As there are no empirical studies available on this area to date, this is a major psychological question. We contribute to reducing this need due to the empathy subscales operationalizing cognitive and affective empathy separately.

4.4 Psychological Evaluation

4.4.1 Objectives and Design There is substantial evidence within the psychological literature that bullying and victimisation is a worldwide problem with serious deleterious consequences (Wolke et al., 2001). Research studies are now concentrating on the individual differences of children involved in bullying behaviour in terms of those children categorised as ‘pure’ bullies, ‘pure’ victims, bully/victims and neutral children, who can either be bystanders in bullying incidents or defenders to the victim. However, there is still little known about the individual differences in cognitive styles (e.g. differences in theory of mind skills) for children involved in bullying behaviour and coping and planning skills in combating victimisation. (Please refer to deliverable d.2.2.2 for details about theory of mind).

The major aim of the Psychological Evaluation is to determine whether user characteristics (roles in bullying behaviour) are reflected in user choices regarding the bullying scenarios in terms of mental representations (e.g. theory of mind), action choices, empathy and differences in coping strategies. Two major questions will be investigated: 1) Are there any differences in Theory of Mind responses for children classified as ‘pure’ bullies, ‘pure’ victims, bully/victims or neutral children for both direct and relational bullying behaviour? 2) a) Are there distinctions between the types of coping mechanisms selected by ‘pure’ bullies, ‘pure’ victims, bully/victims and neutral children when interacting with the bullying scenarios in the Demonstrator? b) Are there differences in terms of the justifications that children state for selecting coping mechanisms according to bullying status? c) Are there any differences in how the children empathise with the characters in the scenarios (empathy) according to bullying status? In order to address the above research questions adequately, a large sample of children (N: 400) will be required to take part in the Psychology Evaluation. This is due to the skewed nature of children classified as ‘pure’ bullies, ‘pure’ victims, bully/victim and neutral children. More children in the sample will be classified as neutral and few children will be classified as ‘pure’ bullies. A sample of 400 children will allow for adequate statistical analyses with the inclusion of all bullying groups.

The Psychology Evaluation will take place over a period of two weeks and each child will take part in the design protocol of the evaluation once. The facilities at the University of Hertfordshire will permit two classes to take part in the evaluation event at any one time - 33 - Deliverable7.1.1/final Deliverables Report

resulting in a high standardised procedure with all children taking part at the same time. Initially, children will be required to complete the bullying questionnaire followed by a session with the bullying scenarios within the VICTEC Demonstrator. During the session with the Demonstrator, children will complete a number of justification questions regarding their choices for different types of coping strategies to deal with a variety of bullying situations. Following the Demonstrator session, children will complete a number of questions pertaining to emotion recognition, Theory of Mind and empathy.

4.4.1.1 Proposed Timetable of Events for the Psychology Evaluation hosted at the University of Hertfordshire, U.K. in June 2004 Dates for the Psychology Evaluation have been confirmed with the Computer Science department at the University of Hertfordshire for 14th – 25th June 2004. A preliminary timetable of events has been devised by the team at Hertfordshire as shown in Table 7 in Appendix C. The team decided that in order for schools to be interested and enthusiastic about the Psychology Evaluation, other events would have to take place for the children in addition to taking part in the evaluative work of the bullying scenarios within the VLE. These events are relevant to the project and are outlined in Table 8 in Appendix C. The extra activities will be comprised of 4 groups that children can participate in, but due to time constraints, children will have the opportunity to take part in 3 sessions lasting 30 minutes each. Children will be issued with coloured badges/letter badges assigning them to groups and then will follow the timetable outlined in Appendix C (Table 8). 4.4.1.2 Administration Details (e.g. Ethical Approval) for the Psychology Evaluation Work has commenced regarding an outline of the budget to host the Psychology Evaluation and other administrative details. Companies are being contacted regarding potential sponsorship for the event and the External Relations Department at the University of Hertfordshire have been contacted to advertise and market the event. Ethical approval has been granted by the Department of Computer Science to host the event.

4.4.1.3 School Details regarding the Psychology Evaluation in 2004: Name for the Event, Advertising Leaflet, ChildLine Conference, School Contacts The team has devised a name for the Psychology Evaluation ‘Virtually Friends: Social role- play with robots and agents’ which will be used throughout the planning and launching of the event for information leaflets, presentations etc. A leaflet has been designed about the event and will be distributed to teachers interested in taking part. Contacts have been made in terms of recruiting schools for the evaluation including the head teacher at one of the schools approaching the ‘head teacher consortium’ in Hertfordshire which will greatly assist in the recruitment of schools. - 34 - Deliverable7.1.1/final Deliverables Report

Members of the VICTEC team attended a ChildLine conference day in London, March 2003. Here we presented the VICTEC project and presented details about the proposed evaluation to various schools.

4.4.2 Description of the Diagnostic Instruments of the Psychological Evaluation

4.4.2.1 Evaluation of Direct and Relational Bullying Scenarios for the Psychology Evaluation (see Appendix D.5 on CD) Schools visits have been made to evaluate the content and suitability of the initial bullying scenarios designed for the VICTEC Demonstrator and evaluation. Children aged 9-10 viewed one direct and one relational bullying scenario which were designed using Kar2ouche software (www.kar2ouche.com). The scenarios were viewed in a video format but did not have active speech. Speech and thought bubbles with text boxes were used instead. Subsequently, each child individually completed a questionnaire about the scenarios. Questions pertained to: - Whether the bullying behaviour displayed in the scenarios was realistic and happened in the same manner at their school. - What coping strategies the children would utilise to try and stop the bullying happening again in the future. - The children were asked why they thought the characters in the scenarios bullied or were victimised. - Children were asked to choose their favourite character and least preferred character from the direct and relational scenarios and state the reasons why they selected a particular character. - Some questions about the emotions displayed in the scenarios were asked concerning whether they felt sorry for any of the characters and whether any of the characters made them feel angry. Children were asked to provide an explanation for their selection. - An overall question about how the children felt after watching the scenarios was asked at the end of the questionnaire ranging from ‘very happy’ to ‘very sad’. Approximately 80 children from 3 different schools in the U.K. and one class from a school in Portugal have completed the scenario evaluation questionnaire. Germany is awaiting ethical consent with the schools to carry out the evaluation but will be doing this shortly. Overall results are currently been collated and analysed and will reported in the near future. These results will provide significant input for the final design elements of the scenarios, for the final version of the VICTEC VLE during the Psychology Evaluation.

4.4.2.2 Bullying Questionnaire (see Appendix D.4 on CD) Description The Bullying Questionnaire (see Appendix D.4 on CD) to be used as part of the Psychology and Pedagogical Evaluation is available in English, German and Portuguese and is described as part of the Pedagogical Evaluation. Application The Bullying Questionnaire is an essential assessment instrument for the psychology and Pedagogical Evaluation as it permits the classification of children into direct and relational bullying classifications for ‘pure’ bullies, ‘pure’ victims, bully/victims (those children who both bully and are victimised at other times) and neutral (bystanders, defenders) children. The questionnaire will be given to children to complete before they take part in the VICTEC Demonstrator session with scenarios. The researcher will explain each question on the questionnaire to the class as a whole, and each child will complete it individually in their own time. If they have any questions, they will be allowed to ask a researcher on hand. Completion of the Bullying Questionnaire takes approximately 15 minutes. Quality Factors The Bullying Questionnaire is based on the Olweus (1993) bullying questionnaire which has been widely used to assess bullying behaviour in a wide range of countries. The items on the questionnaire were devised by experts in the bullying field indicating high content validity. The bullying questionnaire also demonstrates high concurrent validity as bullying roles in terms of bullies, victims, bully/victims and neutrals correlate with children’s sociometric status in terms of being rejected, neglected, controversial, average, popular (Wolke & Stanford, 1999). The questionnaire has also revealed high predictive validity in determining the stability of bullying roles over a period of several years (Wolke, Woods & Samara, in submission). However, it is necessary to pilot the current version of the Bullying Questionnaire. This has been carried out with one class from a school in Hertfordshire, U.K. and one class in Portugal. Once ethical permission is granted in Germany, the questionnaire will also be piloted there.

As a result of the pilot study, some minor amendments to the questionnaire are necessary.

Comments from the pilot carried out in the U.K.

- Who Child lives with at home: Feedback from the children indicated that some of them struggled with question no. 1 ‘who do you live at home with’. It has been decided to have categories that they tick rather than writing in the allocated space who they live with. - Friendship Questions (liking and disliking): Most children provided competent responses to the questions about friendship within their class. This required them to write the ID number of the child from a class sheet containing all the names of the children in the class. Only a minority of children occasionally forgot to put the ID numbers in the allocated space. - Questions about physical bullying: It needs to be made clearer in this section that children are required to put the ID numbers of children who they have selected as being involved in physical bullying. A tick box will be added to the questionnaire for them to fill in the ID numbers of children to make it more obvious. Most children understood questions in this section regarding the identification of 'pure' bullies and victims in the class but many struggled with the bully/victim distinction. The decision has been made to make the introduction about physical bullying much clearer with more pictorial illustrations and the section on bully/victims will be omitted. - Questions about relational bullying: This section caused the most confusion for children. A lot of them found it too repetitive and thought that it was asking exactly the same as for physical bullying, highlighting that we need to make this distinction between direct/physical and relational bullying clearer. Once again, children did not grasp the concept of bully/victims, and a lot of children left this section completely blank as they thought it was repeated. - Questions about sibling bullying: Most children completed this well. A sentence needs to be added informing children that they only need to complete this section if they have brothers and sisters. - Questions about strength: Most children understood this section about rating whether they thought they were stronger or weaker than the other boys and girls in their class, although some children left it blank for the opposite gender. - Questions about hobbies: Most children understood this section although it should be made clearer that we are just referring to karate, judo etc and not hobbies in general. - Questions about computer games: All children managed this section. A tick box option will be added to the questionnaire to enable researchers to categorise whether the computer games are: fighting games, adventure games, sport games, learning/educational games etc.

Comments from the bullying questionnaire pilot carried out in Portugal

- Friendship Questions (liking and disliking): Most children understood this section of the questionnaire.

- Sibling bullying: Some children answered the frequency bullied question even when they had answered no to experiencing all the types of bullying. They had some comprehension problems with this section. This will be modified to make it clearer.

Differences in bullying rates between the U.K. and Portugal  More U.K. children stated that they were physically and relationally victimised than Portuguese children.  Many more Portuguese children said that they physically and relationally bullied other children compared to U.K. children  Portuguese children stated that they bullied their siblings ‘a lot’ more than U.K. children.  Portuguese children were more frequently bullied ‘a lot’ by their siblings compared to U.K. children.

Table 9 in Appendix C details differences between the UK and the Portugal sample regarding the Bullying Questionnaire data.

Analysis and Interpretation The data collected from the Bullying Questionnaire will permit quantitative statistics in for the of chi-square cross-tabulations. Regression analyses will also be conducted to determine predictors of bullying status. The Bullying Questionnaire will also provide useful insights concerning difference in interaction styles according to bullying profile, coping strategies used, justifications for using different strategies and differences in ToM abilities and emotion recognition.

4.4.2.3 Theory of Mind Questions (ToM) Description Theory of mind refers to the ability to predict and explain the behaviour and feelings of others based on reference to mental states, beliefs, desires and percepts (Astington, 1993; Wellman, 1990). Theory of mind abilities and differences between bullying roles (bully, victim, bully/victim, neutral) is an area that has received little research attention and may be an important factor for explaining children’s involvement in bullying roles and the stability of these roles. Content of ToM questions 1) Initial emotion questions: The initial emotion questions will refer to character emotions directly after the main bullying incident(s) illustrated by the storyboard. The characters will of course be different depending on the type of scenario. For example, in one scenario the main characters might be a victim, bully/victim and bystander, and in another scenario the main characters could be the bully, victim and assistant to the - 38 - Deliverable7.1.1/final Deliverables Report

bully. Emotion questions will be based on the Ekman emotions (happiness, sadness, surprise, anger, fear, disgust) using facial representations (Ekman & Friesen, 1986). The child will be instructed to click on the face that they think represents the feelings of the character they are being asked about and will then be asked the question ‘which emotion to you mean?’ to clarify children’s interpretation of emotions. a) What do you think is happening in this scene? Children will be required to write their responses. b) How does Luke (bully) feel? Please click on the face. c) How does John (victim) feel? Please click on the face. d) How does Martina (bystander) feel? Please click on the face. 2) ToM Questions: These questions will follow the initial emotion questions and refer to the main bullying incident(s). The questions will be formatted as follows: a) What does Luke (bully) think about John (victim)? b) What does John (victim) think about Luke (bully)? c) If you were John (victim), why do you think that Luke (bully) is doing this? d) If you were Luke (bully), why is he doing this to John (victim)? 3) End Emotion Questions: These questions will follow the ToM questions and will ask the child about the characters’ feelings at the end of the scenario (once coping styles have been tried etc). The procedure follows the initial emotion questions. a) How does John (victim) feel at the end of the story? Please click on a face. b) How does Luke (bully) feel at the end of the story? Please click on a face. A timetable outlining the future work on the Psychological Evaluation can be found in the appendix (Table 11). Application Theory of mind questions will be asked as part of the Psychology Evaluation event when the child has finished interacting with a bullying scenario in the form of an electronic storyboard which will illustrate the main sequence of events that happened during the scenario. The child will then be asked a set of questions in electronic format about inferring the emotions, mental states and intentions of the main characters in the story. A design protocol for the format of ToM questions has been devised alongside a draft set of questions. Table 10 in Appendix C illustrates the design protocol for the ToM assessment with each child participating in the Psychology Evaluation. It will involve a 3*2 design: 3 bullying types (physical, verbal and relational) * 2 coping responses (successful/unsuccessful) presented in a randomised order.

Quality Factors The ToM questions still require piloting in the U.K. in terms of child comprehension and validity with other measures. The questions were devised by experts in the field and are related to previous first order and second order false belief questions used by Happe and Frith (1996). Analysis and Interpretation The majority of the emotion questions with the exception of the first question will permit the generation of frequency and percentage data. This will allow the team to analyse mistakes made by bullying groups (bully, victim, bully/victim and neutral) and discrepancies between selected emotions (button pressed) and what children verbally interpret the emotion as representing. The ToM questions will require children to write some text responses. These responses will then need to be content analysed and be compatible with the Statistics package NUDIST that generates trends and common themes on qualitative data sets.

4.4.2.4 DANVA (Diagnostic Analysis of NonVerbal Accuracy) Description The DANVA (Nowicki and Duke, 1994) is a well established assessment instrument comprised of a series of tests which assess individuals’ abilities in identifying emotion being communicated in others’ facial expression and tones of voice. One of the major strengths of the DANVA is the means of allowing children to complete it in electronic format. The child facial expression test consists of 24 photographs equally distributed between high and low intensity expressions of four emotions: happy, sad, angry and fearful. Application The DANVA will form part of the Psychology Evaluation carried out before children take part in the VICTEC Demonstrator session. Each child will complete the DANVA individually following instruction from the researcher. Data records will be recorded electronically of children’s responses to both the face and voice tests. Children will see peoples’ faces and the child is asked to say how what they think the person in the photo is feeling (happy, sad, angry or fearful). To help children, they will see the four emotion labels for each photo. The child sees each photo for 2 seconds. Quality Factors Tests have shown the DANVA to have high internal consistency (coefficient alphas above . 70) and to be reliable over time (test-re-test reliabilities between .70 and .80 over 6 to 8 week periods). Construct validity support is also evident from results of over 200 studies with age ranges from 3 years to 80.

Analysis and Interpretation The DANVA will provide useful insights regarding the emotion recognition abilities of children taking part in the Psychology Evaluation. The DANVA is scored for separate emotions by intensities. Error profile tables can be computed for each emotion illustrating the number of errors made and the most frequent alternative emotion selected in place of the incorrect one. The DANVA has never previously been carried out to examine differences in bullying profiles. This data can be correlated with the ToM assessment and differences according to error rates can be determined for the different bullying profiles. For example, are there significant differences in DANVA emotion abilities for direct and relational bullying profiles? The DANVA will also allow the association between emotion recognition skills and types of coping strategies selected to deal with bullying situations within the VICTEC Demonstrator. This will provide an overall picture and understanding of particular children who have social difficulties.

Pilot testing for the entire Psychology Evaluation protocol will be conducted in early 2004 to ensure that the methodology, procedures and assessment instruments are all acceptable.

5 Conclusions

In this document we have described the various preparations regarding the evaluation of the two pieces of software which are to be delivered by the VICTEC project, the Toolkit and the Demonstrator. As was announced in Deliverable 2.2.2 (Evaluation Methodology) it focuses on the concrete evaluation tools that will be used. The evaluation of the Toolkit already led to the identification of problematic features and potential solutions by experts and users, which are to be passed on to the developers. Regarding the evaluation of the Demonstrator, detailed information is provided on the development and on the quality factors of self-developed instruments including statistical and qualitative analysis of the data collected within the evaluation process. Preliminary results from pilot studies account for the reliability and validity of these instruments suggesting that the tools so far developed are appropriate. Beyond this, there is evidence from the Trailer Approach that story comprehension and believability is generally high, with the content of the storyline being found highly interesting and true to life. The introduced evaluation tools are easy to apply and cover a broad variety of different sources (potential user groups & experts, children & teachers) and aspects (impact of the software on cognitive, behavioural, and emotional processes) promising high validity of the entire evaluation process. The integration of data from this variety of different evaluation sources / methods offers a broad perspective on technological, pedagogical, psychological, and usability aspects of the software products to be delivered by the VICTEC project. And finally, attention is paid to gathering information on the potential user groups’ acceptance of these products which accounts for an essential duty for the VICTEC team.

References:  Arsenio, W. F. & Lemerise, E.A. (2001). Varieties of Childhood Bullying: Values, Emotion Processes and Social Competence. Social Development, 10 (1), 59-74.  Astington, J. (1993). The child’s discovery of mind. Cambridge, MA: Harvard University Press.  Brown, K., Atkins, M. S., Osborne, M. L. & Milnamow, M. (1996). A Revised Teacher Rating Scale for Reactive and Proactive Aggression. Journal of Abnormal Child Psychology, 24 (4), 473-480.  Bryant, B. K. (1982). An Index of Empathy for Children and Adolescents. Child Development, 53 (1), 413-425.  Ekman, P. (1989). The argument and evidence about universals of facial expression. In H. Wagner & A. Manstead (Eds.), Handbook of social psychophysiology (pp. 143- 164). Chichester (UK): Wiley.  Glow, R. A., Glow, P. H. & Rump, E. E. (1982). The Stability of Child Behaviour Disorders: A One Year Test-Retest Study of Adelaide Versions of the Conners Teacher and Parent Rating Scales. Journal of Abnormal Child Psychology, 10 (1), 33- 60.  Gould, J. D., Boies, S. J., & Clayton, L. (1991). Making usable, useful productivity- enhancing computer applications. Communications of the ACM, 34(1), 75-85.  Happe, F., & Frith, U. (1996). Theory of mind and social impairment in children with conduct disorder. British Journal of Developmental Psychology, 14, 385-398.  Immersive Education (2001). Kar2ouche. Oxford.  International Standards Organisation (1998). Ergonomic requirements for office work with visual display terminals (VDTs) -- Part 11: Guidance on usability.  Ledingham, J. E., Younger, A., Schwartzman, A. & Bergeron, G. (1982). Agreement among Teacher, Peer, and Self-Ratings of Children’s Aggression, Withdrawal, and Likeability. Journal of Abnormal Child Psychology, 10 (3), 363-372.  Leibetseder, M., Laireiter, A.-R., Köller, T. (2001). E-SKala: Fragebogen zur Erfassung von Empathie – Beschreibung und psychometrische Eigenschaften. Zeitschrift für Differentielle und Diagnostische Psychologie, 1, 70-85.  Maumary-Gremaud, A. (2000). Teacher Rating of Student Adjustment (FAST Track Project Technical Report). http://www.fasttrackproject.org/techrept/t/tsa/tsa7tech.doc  Mehrabian, A. & Epstein, N. (1972). A measure of emotional empathy. Journal of Personality, 40, 525-543.

 Miller, P. A. & Eisenberg, N. (1988). The Relation of Empathy to Aggressive and Externalizing/Antisocial Behavior. Psychological Bulletin, 103(3), 324-344.  Minogue Nicholas P., Kingery P. M. Murphy, L. (1999). Approaches to Assessing Youth Violence. Washington, D.C.: The Hamilton Fish Institute. http://hamfish.org/pub/vio_app.pdf  Nielsen, J. (1992). Finding usability problems through heuristic evaluation. Paper presented at CHI'92, Monterey, CA, 373-380.  Nielsen, J. (1993). Usability Engineering, Academic Press Inc: London.  Nielsen, J. (1994a). Enhancing the Explanatory Power of Usability Heuristics. Paper presented at the CHI'94 Conference on Human Factors in Computing Systems, ACM Press, 152-158.  Nielsen, J. (1994b). Guerrilla HCI: Using Discount Usability Engineering to Penetrate the Intimidation Barrier. In R. G. Bias, Mayhew, D.J. (Ed.), Cost-Justifying Usability (pp. 245-272). London: Academic Press Inc.  Nielsen, J. (2002). Heuristic Evaluation. Available: http://www.useit.com/papers/heuristic  Nowicki, S. Jr., & Duke, M. P. (1994) Individual differences in nonverbal communication of effect: The Diagnostic Analysis of NonVerbal Accuracy Scale (DANVA). Journal of Nonverbal Behaviour, 18, 9-35.  Olweus, D. (1993). Bullying in schools: what we know and what we can do. Oxford: Blackwell Publishers.  Preston, S. D. & de Waal, F. B. M. (2001). Empathy: Its ultimate and proximate bases. Behavioral and Brain Sciences. Online 8/01: http://www.bbsonline.org/Preprints/Preston; print edition in press.  Rudd, J., Stern, K., & Isensee, S. (1996). Low vs. high-fidelity prototyping debate. Interactions, 3(1), 76-85.  Schaefer, M. (1996). Different Perspectives of bullying. Poster presented at the XIV Meetings of ISSBD, August, Quebec, Canada.  Seitz, W. & Rausche, A. (1976). Persönlichkeitsfragebogen für Kinder. Westermann: Braunschweig.  Snyder, C. (1996). Using Paper Prototypes to Manage Risk. Software Design and Publishing Magazine.  Sutton, J., Smith, P.K., & Swettenham, J. (1999). Social cognition and bullying: social inadequacy or skilled manipulation? British Journal of Developmental Psychology, 17 (3), 435-450.

 Van den Bergh, B. & Marcoen, A. (1999). Harter's Self-Perception Profile for Children: Factor structure, reliability and convergent validity in a Dutch-speaking Belgian sample of fourth, fifth and sixth graders. Psychologica Belgica, 39 (1), 29-47.  Wellman, H. (1990) The child’s theory of mind. Cambridge, MA: MIT Press.  Wolke, D., & Stanford, K. (1999). Bullying in school children. In D. Messer & S. Millar (Eds.), Developmental Psychology. London: Arnold Publisher.  Wolke, D., Woods, S., Schulz, H., & Stanford, K. (2001). Bullying and victimisation of primary school children in South England and South Germany: Prevalence and school factors. British Journal of Psychology, 92, 673-696.

APPENDIX A- Technical Evaluation of the Demonstrator (Tables)

TOOLS MONTH 1 MONTH 2 Internal Test Final Internal Tests and debugging Experts evaluation Technological Usability Debugging (*) Final users tests Performance Measurement Questionnaire/Interviews Table 1: Evaluation Schedule for Technical Evaluation (*) The time to accomplish this task depends on the errors found.

APPENDIX B- Pedagogical Evaluation of the Demonstrator (Tables)

Date Action February 2004 Pretest with the following diagnostic instruments:  picture story  Teacher Rating  Bullying Questionnaire  Empathy Questionnaire March 2004 Application of Demonstrator in schools April 2004 Posttest (same diagnostic instruments as in pretest)

Table 2: Timetable for Pedagogical Evaluation

Correlation answer to empathy (overall) cognitive empathy affective empathy r = 0.40 r = 0.28 r = 0.40 “How bad are the things taking place in the story?” with … Table 3: Correlation “How bad are the things taking place in the story?” with empathy (n=94)

Direct Bullying Number of pupils named as bullies, victims, bully/victims Frequency information on subjects behaviour (bullying, being bullied) Relational Bullying Number of pupils named as bullies, victims, bully/victims Frequency information on subjects behaviour (bullying, being bullied) Direct Sibling Bullying Frequency information on subjects behaviour (bullying, being bullied) Table 4: Bullying data for pre-/posttest-comparison from Bullying Questionnaire

Via situation Cognitive empathy “I often try to understand my friends better by seeing things from their point of view” (Leibetseder; translation by author) Affective empathy “It makes me sad to see a girl who can’t find anyone to play - 47 - Deliverable7.1.1/final Deliverables Report

with” (Bryant) Via expression Cognitive empathy “I can tell what mood my father’s in by the look on his face” (Zoll) Affective empathy “Even when I don’t know why someone is laughing, I laugh too” (Bryant) Ideomotoric “When I see somebody dancing, it makes me want to dance too” empathy (Zoll) Table 5: Scales of the Empathy Questionnaire

Empathy Scale Cronbach’s  (Internal Consistency) Empathy total score (51 Items) =.84 Cognitive Empathy, situation-mediated =.66 Cognitive Empathy, expression-mediated =.58 Affective Empathy, situation-mediated =.63 Affective Empathy, expression-mediated =.65 Ideomotoric Empathy =.57 Table 6: Reliability Coefficients (Internal Consistency) for Empathy Scales.

APPENDIX C- Psychological Evaluation of the Demonstrator (Tables)

Time Event 9.30 – 10.00 Introduction to the children about the days events and answer any potential questions about am the day

10.00 – 11.00 Psychology Evaluation using the computers takes place. This will involve the children completing the bullying q’aire and then interacting with the VLE for approx 30 mins. There will also be a series of questions for them to answer at the end of the interaction. Need to verify what type of evaluation Lynne & Polly are hoping to do. (if there are children who have not received parental consent, they will be able to play a computer game while the rest of the children do the VLE interaction. This will ensure that they are not singled out)

11.00 – 11.30 Refreshments

11.30 – 1.00 Robot demos: Possible ideas are to show some of the LEGO robot demos, robocup simulation, Peke the robot. Idea is to have this as 3 parallel sessions lasting 35 mins each. Children split into 3 groups and rotate to each one: 1) Bullying session. 2) LEGO robot demo 3) Robocup simulation 4) Peke Robot session

1.00 – 1.45 Lunch

1.45 – 2.00 Day Ends after lunch Table 7: Timetable of event for the Psychology Evaluation

Bullying talk Peke demo Lego demo Robocup

1 2 3 / Red Group 11.30-12.00 12.00 – 12.30 12.30 – 1.00 FREE

2 / 1 3 Blue Group 12.00 – 12.30 FREE 11.30 – 12.00 12.30 – 1.00

3 1 / 2 Green Group 12.30 – 1.00 11.30 – 12.00 FREE 12.00 – 12.30

Table 8: Activity Groups for children to participate in after the evaluation

U.K data Portugal data N: 23 male (74.2%) & N: 8 female N: 13 male (54.2%) & N: 11 female Gender (25.8%) (45.8%) Siblings N: 29 (93.5%) had a least 1 sibling N: 17 (71%) had at least 1 sibling Physically victimised N: 22 (71.0%) said yes N: 16 (66.7%) said yes N: 15 (68.2%) not much, N: 4 (18.2%) N: 8 (50.0%) not much, N: 8 (50%) Frequency physically quite a lot (> 4 times over school term), quite a lot (> 4 times over school term), victimised N: 3 (13.6%) a lot (a few times every N: 0 (0%) a lot (a few times every week). week). Physically bully others N: 6 (19.4%) said yes N: 17 (70.1%) said yes Frequency physically N: 6 (100%) not much N: 9 (53%) not much, N: 5 (29.4%) bully others quite a lot, N: 3 (17.6%) a lot. Relationally victimised N 22 (71%) said yes N: 12 (50%) said yes Frequency relationally N: 12 (54.5%) not much, N: 6 (27.3%) N: 6 (50%) not much, N: 3 (25%) quite victimised quite a lot, N: 4 (18.2%) a lot. a lot, N: 3 (25 %) a lot. Relationally bullying N: 8 (25.8%) said yes N: 13 (54.2%) said yes others: Frequency relationally N: 6 (75%) not much, N: 2 (25%) quite N: 7 (53.8%) not much, N: 6 (46.2%) bully others a lot. quite a lot. N: 8 (27.6%) not much, N: 5 (17.2%) N: 5 (29.4%) not much, N: 2 (11.8%) Frequency bullied by quite a lot, N: 12 (41.4%) a lot. (5 data quite a lot, N: 11 (58.2%) a lot. siblings sets missing) N: 11 (37.9%) not much, N: 8 (27.6%) N: 3 (18.7%) not much, N: 6 (37.5%) Frequency bully siblings quite a lot, N: 4 (13.8%) a lot. (4 data quite a lot, N: 7 (43.8%) a lot. sets missing) Table 9: Summary of data from the Bullying questionnaire in the U.K. and Portugal

Physical bullying Verbal bullying Relational bullying scenario scenario scenario Predominantly  Initial emotion  Initial emotion  Initial emotion unsuccessful Questions Questions Questions coping response  ToM questions  ToM questions  ToM questions outcomes  Post emotion  Post emotion  Post emotion questions questions questions  Initial emotion  Initial emotion  Initial emotion Questions Questions Questions Predominantly  ToM questions  ToM questions  ToM questions successful coping response outcomes  Post emotion  Post emotion  Post emotion questions questions questions Table 10: Design protocol for ToM assessment as part of the Psychology Evaluation

DATE EVENT - 51 - Deliverable7.1.1/final Deliverables Report

- Translation of the bullying questionnaire into German and Portuguese. Completed - Ethic approval forms submitted to the UH board for Psychological Evaluation. Jan 2003 Completed - COMPLETED - Pilot the bullying questionnaire in schools in the UK. Once this has been done, pilot studies will take place in Portugal and Germany. U.K and Portugal pilot completed. Feb 2003 – Awaiting for Germany Mar 2003 - Carry out necessary modifications. - COMPLETED in U.K. and Portugal. - Schools visits to take place to evaluate the content of the bullying scenarios for the Psychological Evaluation. Storyboards for direct and relational scenarios initially devised have been evaluated in U.K. and Portugal – provide overview of main outcomes of this. Mar 2003 – Apr 2003 - Carry out necessary modifications. - Bullying scenarios evaluated in U.K. and Portugal - New scenarios currently being developed by children. - Generation of Theory of Mind questions and justification questions to be used in the Apr 2003 – evaluation. May 2003 - Work is ongoing with developing these questions - Theory of mind and justification questions to be piloted to ensure that the children understand them. May 2003 – Jun 2003 - Carry out necessary modifications. - Scheduled to take place September 2003 - Ideas to be proposed for the activities that the children will take part in during the day at Jun 2003 UH in addition to taking part in the evaluation. Travel arrangements and possible onwards sponsors to be considered. Been done - These are outlined above - Schools contacted in the Hertfordshire region to take part in the Psychological Evaluation. School visits carried out where necessary to encourage schools to take part. Contacts have been made Jan 2004 onwards - Pedagogical Evaluation takes place. Information from this used to aid Psychological Evaluation. - Schools contacts are already being established Apr 2004 – - Technical equipment checked etc for the running of the evaluation. Jun 2004 - Volunteers recruited to help run the evaluation. Jun 2004 - Psychological Evaluation takes place at UH. - Data collated and analysed. Jun 2004 - Publications prepared. onwards - Project feedback provided to schools. Table 11: Timetable for future work

- 52 - Deliverable7.1.1/final