DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015

Compiling attention datasets

DEVELOPING A METHOD FOR ANNOTATING FACE DATASETS WITH HUMAN PERFORMANCE ATTENTION LABELS USING CROWDSOURCING

DANIEL ROMULD AND MARKUS RUHMÉN

KTH ROYAL INSTITUTE OF TECHNOLOGY

CSC SCHOOL KTH Computer Science and Communication

Compiling attention datasets

Developing a method for annotating face datasets with human performance attention labels using crowdsourcing

Daniel Romuld & Markus Ruhmén

Degree Project in Computer Science, DD143X Supervisor: Richard Glassey Examiner: Örjan Ekeberg

CSC, KTH June 2, 2015 Sammanfattning Denna uppsats behandlar problemet med att upptäcka mänsklig uppmärk- samhet, vilket är ett problem inom datorseende. För att göra framsteg mot att lösa problemet utvecklades en metod för att skapa uppmärksamhets- märkningar till dataset av ansiktsbilder. Märkningarna utgör ett mått av den uppfattade uppmärksamhetsnivån hos personerna i bilderna. Arbetet i denna uppsats motiveras av avsaknaden av dataset med uppmärksamhets- märkningar och den potentiella användbarheten av den framtagna meto- den. Metoden konstruerades med fokus på att maximera tillförlitligheten och användbarheten av insamlad data och det resulterande datasetet. Som ett första steg i metodutvecklingen genererades bilder på folkmassor genom att använda datasetet Labeled Faces in the Wild. Evaluering av uppmärk- samhetsnivån hos personerna i bilderna, som individer i en folkmassa, blev då möjligt. Denna egenskap utvärderades av arbetare på crowdsourcing- plattformen CrowdFlower. Svaren analyserades och kombinerades för att be- räkna ett uppmärksamhetsmått med mänsklig prestanda för varje individ i bilderna. Resultatanalysen visade att svaren från arbetarna på CrowdFlower var tillförlitliga med hög intern konsistens. Den framtagna metoden ansågs vara ett giltigt tillvägagångssätt för att skapa uppmärksamhetsmärkningar. Möjliga förbättringar identifierades i flera delar av metoden och redovisas som del av uppsatsens huvudresultat.

Abstract This essay expands on the problem of human attention detection in computer vision. This is achieved by providing a method for annotating existing face datasets with attention labels through the use of human intelligence. The work described in this essay is justified by a lack of human performance attention datasets and the potential uses of the developed method. Several images of crowds were generated using the Labeled Faces in the Wild dataset of images depicting faces. Thus enabling evaluation of the level of attention of the depicted subjects as part of a crowd. The data collection methodology was carefully designed to maximise reliability and usability of the resulting dataset. The crowd images were evaluated by workers on the crowdsourcing platform CrowdFlower, which yielded human performance attention labels. Analysis of the results showed that the submissions from workers on the crowdsourcing platform displayed a high level of consistency and reliability. Hence, the developed method, although not fully optimised, was deemed to be a valid process for creating a representation of human attention in a dataset. Contents

1 Introduction 1 1.1 Problem statement ...... 1 1.2 Method structure ...... 2 1.3 Projected outcome and contribution ...... 3 1.4 Overview ...... 3 1.5 Terminology ...... 4

2 Background 4 2.1 Face datasets ...... 5 2.1.1 Uses for datasets ...... 5 2.1.2 Compilation and design of datasets ...... 7 2.2 Data gathering ...... 7 2.2.1 Conventional methods ...... 8 2.2.1.1 College samples ...... 8 2.2.1.2 Web surveys ...... 8 2.2.2 Crowdsourcing ...... 9 2.2.2.1 CrowdFlower ...... 9 2.2.2.2 ’s Mechanical Turk ...... 10 2.2.2.3 Ethical concerns of using crowdsourcing . . . 11 2.3 Summary ...... 11

3 Method 11 3.1 Generating images ...... 11 3.1.1 Why LFW ...... 12 3.1.2 Method of choosing members ...... 13 3.1.3 Background removal ...... 13 3.1.4 Actual image generation ...... 14 3.1.5 Design details ...... 15 3.1.6 Generated files ...... 16 3.1.7 Software used ...... 17 3.2 Collecting data ...... 17 3.2.1 Beta test ...... 18 3.2.1.1 Question ...... 19 3.2.1.2 Setting ...... 19 3.2.2 CrowdFlower study ...... 19 3.2.2.1 Why CrowdFlower ...... 20 3.2.2.2 Question and instructions ...... 20 3.2.2.3 Example ...... 21 3.2.2.4 Image ...... 21 3.2.2.5 Input ...... 21 3.2.2.6 Wage ...... 21 4 Beta test results, analysis and conclusions 22 4.1 Results ...... 22 4.2 Analysis and conclusions ...... 22

5 CrowdFlower study results and analysis 23 5.1 Primary results ...... 23 5.2 Secondary results ...... 23 5.3 Analysis ...... 23

6 Discussion 25 6.1 Image generation ...... 25 6.2 CrowdFlower setting ...... 26 6.3 Sampling method ...... 26 6.4 Further development ...... 27

7 Summary and conclusions 27

A Crowd details 32 A.1 Details and naming ...... 32 A.2 Post-study changes ...... 32

B CrowdFlower submissions 33 B.1 Format ...... 33 B.2 Post-study changes ...... 33

C The LFW+at dataset 34

D CrowdFlower task example 39

List of Figures

1 Structure of the essay’s method ...... 2 2 Structure of the proposed annotation method ...... 2 3 Examples of images before and after background removal . . 13 4 Examples of images before and after head size normalisation . 14 5 Example of a crowd partitioned into zones ...... 15 6 Example of small section of a crowd permutation image . . . 17 7 Spreadsheet used during beta test ...... 18 8 Second LFW image of Anibal Ibarra ...... 24

List of Tables

1 Some modern datasets ...... 6 2 Beta test completion times ...... 22 3 Small sample of beta test results ...... 22 4 Cronbach’s α of CrowdFlower submissions ...... 24 5 Submissions to Anibal_Ibarra_0002 ...... 24 1 Introduction

This essay describes the development of a method for annotating face datasets with attention labels. The annotation process consists of assigning a value to each element in a face dataset, which is a collection of images depicting faces. The values are a measure of the perceived level of attention of the depicted subject. Therefore, they are referred to in this essay as human performance attention labels. The method is demonstrated and developed by compiling attention labelling for the Labeled Faces in the Wild (LFW) [1] face dataset. The labels were gathered through crowdsourcing, which is a method of outsourcing the annotation task to large crowds of people [2]. The annotation method’s reliability was evaluated in each step of the process to identify pitfalls and factors that could reduce the method’s reliability. The goal of developing an annotation method is to aid in solving the atten- tion detection problem. The attention detection problem is the problem, in the field of computer vision, of visually detecting whether a person is pay- ing attention to some given object, place or person. Compiling a labelled dataset is a crucial part in solving the attention detection problem since advances in the problem domain cannot be made without data.

1.1 Problem statement

A problem in the field of computer vision is to reliably find a suitable and suf- ficient dataset that will fit the needs of the researcher. Correctly compiling a new dataset is a time consuming and costly endeavor that requires careful planning and consideration. The most commonly used face datasets are of- ten labelled with identity labels. Some include additional labels or attributes such as the angle between camera and face orientation [3, 4], distance [5, 6] and displayed emotion [7, 8], but none include human performance attention labels. This essay proposes a method for annotating existing facial datasets by adding attention labelling through the use of crowdsourcing. The goal is to answer the following questions: • Does the proposed annotation method yield statistically satisfactory results in term of consistency and reliability? • Which pitfalls and possible improvements can be identified by using the proposed annotation method? These questions were answered by formulating a method based on back- ground research and evaluating both the submissions from the crowdsourc- ing platform and the yielded attention labels.

1 1.2 Method structure

The annotation method differs from the overall method used in this essay. The essay’s method consists of two stages as illustrated in Figure 1. The first stage consists of using the proposed annotation method to compile a dataset with attention labels. The second stage involves identifying pitfalls and possible improvements from the knowledge gained from the first stage.

Figure 1: Structure of the essay’s method

The proposed annotation method consists of three stages as illustrated in Figure 2. The first stage in the proposed annotation method is to generate crowd images. Background removal and head size normalization is per- formed on pseudo-randomly chosen images in the chosen face dataset. The resulting images are inserted into a virtual lecture hall. In the second stage, human intelligence is used by gathering data on the chosen crowdsourcing platform. The workers’ opinion of the crowd attendees’ level of attention is requested. In the third stage, the workers’ answers are analysed to evalu- ate reliability and remove frivolous entries. As a result of performing the described stages, a new dataset with attention annotation is created.

Figure 2: Structure of the proposed annotation method

2 1.3 Projected outcome and contribution

The developed annotation method and the compiled dataset, named Labeled Faces in the Wild+attention (LFW+at), will facilitate and enable further research on attention in behavioural sciences such as psychology and peda- gogy. By annotating a dataset with labels such as attention, a new platform on which to build and perform new experiments is created. The compiled dataset would benefit in interdisciplinary research by using it in machine learning algorithms and iteratively approach an accurate definition of at- tention. The developed method and the identified strengths and weaknesses will enable further research and exploration of the attention detection prob- lem and related problems.

1.4 Overview

The essay mostly follows a conventional structure. Section 2 explores the relevant state of the art and theoretical background. The characteristics of several modern face datasets are examined briefly in section 2.1. Some conventional data gathering methods and crowdsourcing are described and compared in section 2.2. In section 3, the image generation and data gathering methods used to test the proposed annotation method are described. Section 3.1 describes the image generation and the design choices made to maximise reliability of the yielded results. A data gathering test was performed prior to the main data gathering process performed by crowdsourcing. This preliminary data gathering test can be described as a beta test as its purpose was to maximise the quality of the main data gathering process. The beta test and main data gathering process performed on CrowdFlower are described in section 3.2.1 and 3.2.2 respectively. Section 4 outlines the results, analysis and conclusions from the beta test. This section combines sections that are often separate. However, the cho- sen structure allows a presentation of the beta test results, analysis and conclusions before presenting corresponding aspects of the CrowdFlower study. Section 5 describes the results and analysis of the results from the Crowd- Flower study. In section 6, the proposed method and its phases are critically discussed. Section 7 contains a short summary and the conclusions drawn from the study. Additionally the most important results are critically evaluated.

3 1.5 Terminology

Crowdsourcing - Method of obtaining a service from large amount of peo- ple. Computer vision - Field of computer science where images are analysed and processed. Machine learning - Scientific discipline that explores the construction and study of algorithms that can learn from data. Face detection - The process of identifying human faces in images. Face classification - Classification of face images into classes. CrowdFlower - Online crowdsourcing service. Amazon’s Mechanical Turk (MTurk) - Online crowdsourcing service. Label - Data describing some feature of a data point, e.g. the identity of a depicted subject in an image. Annotation - The process of labelling one or several data points. Task/Crowdsourcing task - Task or job to be performed by workers on a crowdsourcing platform. Workers/Crowdsourcing workers - Person or entity completing tasks. Requester - Entity requesting the completion of a task.

2 Background

In the area of computer vision, good data is of the utmost importance. In order to accurately evaluate that an algorithm or software is working prop- erly it needs to be both verified and validated [1]. Without dependable data computer vision would have no knowledge to build upon which translates poorly for validation when determining the ability of a software. To tackle this problem of uncertainty it is common in the field of computer vision to make use of datasets, which are collections of images coupled with data representing features of the images. There are a wide variety of different types of datasets used in the field of computer vision. Sets may vary in size and content, such as depicted situations, distinguishing features or medium, depending on what is desired from the researchers [1]. Common uses for datasets in computer vision in particular, but also in behavioural sciences, are examined in section 2.1. The methodology used to compile some modern datasets and design choices made to fill different needs

4 is described. Data gathering is examined in section 2.2, focusing on the state of the art crowdsourcing platforms, CrowdFlower and Amazon’s Mechanical Turk, but also looking at traditional data collection methods.

2.1 Face datasets

A face dataset is, most commonly, a collection of face images. However, face datasets with videos [9] and images with more abstract face representations [10] have been compiled. Datasets of faces can be used in several fields to further knowledge about human behaviour or to extract information from images. In psychological studies, test subjects’ response to the elements in a dataset can be analysed. Through this process, relationships between responses and dataset labels or other characteristics can be inferred. In computer vision datasets are often used to construct and compare face de- tection and classification systems. A list of face datasets can be found in Table 1.

2.1.1 Uses for datasets

In behavioural science fields, face datasets have been used to investigate how people perceive crowd gaze and how it guides group behaviour [10]. Face datasets have also been compiled for use in medical research. Researchers at Karolinska Institutet in Stockholm created the Karolinska Directed Emo- tional Faces (KDEF) [7] to facilitate and enable further medical and psy- chological research [11]. Face datasets are often used in the field of computer vision to train and test systems that perform face detection or several forms of face classification. Some state of the art systems use deep learning [12] to perform face detection [13] or to infer further information from faces in images [14]. Both older [15] and more recent systems use face datasets to first train the system and later to test or benchmark by comparing expected results with actual system output. A set or subset used to test the system when trained is referred to as a test set. Further, training algorithms sometimes use one or several sets to evaluate the training process [16]. Researchers behind the DeepFace project used their own Social Face Classification dataset (SFC) during training of their system, which includes a deep neural network. The LFW [1] and YouTube Faces (YTF) datasets were used to benchmark the system against human performance [14].

5 Dataset Description LFW 13323 images of 5749 celebrities with identity labels [1, 14, 17]. SFC 4.4 million images of 4030 subjects with identity and time labels, taken over some time period [14]. YTF 3425 videos of 1595 subjects [14]. KDEF 4900 images of 70 subjects each displaying 7 different emotional expressions, taken from 5 different angles [7] . Olivetti Faces 400 images of 40 subjects with identity labels [18]. SFD/UMist 564 grey scale images of 20 subjects with identity la- bels [19]. Multi-PIE 750000 images of 337 subjects from differing angles (left to right profile) taken over 5 months [20]. FiA Video from several angles og 180 subjects [21]. JACFEE 56 images of 56 subjects. Equal amount of males and females. Equal amount of Caucasian and Japanese. Each subject displaying one of 7 different emotions [8]. FERET 14051 grey scale images of people from different an- gles (left to right profile) with identity and some angle labels [3, 4]. mmifacedb 2900 frontal and profile videos and images of 75 sub- jects. Used for facial expression analysis [22, 23]. SCface 4160 images of different qualities, with uncontrolled il- lumination and from various distances of 130 subjects. Used mainly to test face recognition algorithms. La- belled with identity, angle, distance and used camera. [5, 6].

Table 1: Some modern datasets

6 2.1.2 Compilation and design of datasets

An abundance of face datasets exist, which differ in size and highlighted parameters. Differences in settings and variability, for subjects and re- searchers, during compilation, affects the dataset’s usefulness in different scenarios. This subsection describes several face datasets focusing on their construction, design and purpose. Constructed in 2007, the Labeled Faces in the Wild dataset was com- piled to enable research into unconstrained face recognition. The images in this dataset have a "natural" variability in many properties of the subjects and settings, such as age, image quality and pose [1]. It is widely used and has become the standard dataset for benchmarking face verification in unconstrained settings [14]. The Surveillance Cameras Face dataset (SCFace) was constructed by recording images of subjects from various angles and distances, with uncon- trolled illumination to facilitate research into face recognition under realistic conditions [5, 6]. The MMI Facial Expression Database (mmifacedb) contains images of subjects showing different emotions [23]. It was compiled as a resource for construction of facial expression analysis algorithms [22]. In 1998, Karolinska institutet constructed the Karolinska Directed Emo- tional Faces (KDEF) dataset with 4900 images of 70 subjects displaying 7 different emotions from 5 different angles. Two pictures of the subject displaying each emotion was taken [7]. The KDEF dataset was originally created to facilitate psychological and medical research [11]. Sweeny and Whitney constructed an experiment using their own face dataset to investigate how humans perceive group gaze. The dataset consists of 16 computer generated faces depicted as a pair of eyes and a mouth. Faces in the set have variable orientation and gaze [10]. Depending on the initial purpose and differences between recorded sub- jects and between settings, the resulting datasets are suitable for different tasks.

2.2 Data gathering

Data gathering is a difficult task in regard to aspects such as time and money. Datasets can be compiled using several data collection methods, such as crowdsourcing, web surveys and college samples. Crowdsourcing eliminates or minimises some constraints of conventional data gathering methods. In section 2.2.1, college samples and web surveys are described and compared

7 to crowdsourcing. These methods are classed as conventional, as opposed to the state-of-the-art, such as crowdsourcing. In section 2.2.2, the new trend of crowdsourcing is described, focusing on CrowdFlower and Amazon’s Mechanical Turk.

2.2.1 Conventional methods

2.2.1.1 College samples

The notion of college samples has been a prominent source of data for re- searchers, especially in the areas of psychology and social science. College samples are normally gathered in a controlled manner where the participants are a well known group of people, such as sophomores or staff [24, 25]. When conducting a study using college samples it is easy to control the environment in which the sample is collected. By regulating the manner in which the information is given to the participants it is easy to monitor reac- tions such as first impressions or individual expressions. Studies have shown that even without a monetary incentive, subjects of college samples are more motivated to participate in studies compared to workers at crowdsourcing platform Mechanical Turk [24, 26]. One of the major concerns with college samples is that they are not demo- graphically diverse [24, 25, 27]. Participants of college samples often share common denominators such as education, socioeconomic status or age. This can become problematic when looking to gauge a general response since the participants are more inclined to share similar views. Subjects are also more likely to have prior knowledge of the subject or be familiar with the experi- mental practices [28].

2.2.1.2 Web surveys

To perform a web survey, also referred to as an internet sample, a researcher gathers a series of questions and distributes them on one or several online platforms to reach a desired target group [29]. Web survey participants tend to be relatively diverse and need not be too negatively affected by frivolous responses or multiple responses from the same person [24, 29]. Using web surveys to collect data, researchers can access samples not accessible by using more traditional methods. This data gathering method enables researchers to gather larger sample sizes and at lower costs than more traditional tech- niques [29]. Typical web surveys are less representative of typical non-college populations and are relatively susceptible to coverage error. Web surveys have a higher

8 risk of displaying heterogeneity of samples across labs and have higher non- response rates [24]. The quality of web surveys varies widely and suffer from subject unwillingness and lack of motivation [30]. Lack of monetary incentive is prevalent and ensuring that the web survey setting does not violate any assumptions or preconditions by the researcher is a difficult task. Therefore, the quality of the gathered data must be taken into consideration [24, 26, 29].

2.2.2 Crowdsourcing

Crowdsourcing is a relatively novel method for data gathering. It can be used as a way to let groups of people to solve problems as a collective. Crowds are used to solve problems which are often hard for computers. This has caught the attention of scientists in fields such as social sciences and computer vision, as it provides an easy way to gain access to a scalable workforce of people from a diverse participation pool that can provide input [24, 28, 31]. An example of crowdsourcing is democratic elections. Another example is given by Surowiecki, where a typical crowd of people is used to measure the weight of an ox by sight. To the scientist’s surprise, the crowd was collectively better at judging the ox’s weight than any individual member [32]. Other examples of crowdsourcing is the NASA Clickworker project in which volunteers could participate and map the surface of Mars by collectively analyse pictures taken of the surface. A total of 101.000 workers collaborated to produce an image which later was confirmed to agree with the image generated through traditional mapping techniques [33]. Many crowdsourcing platforms exist, mainly as services where requesters can publish tasks and workers can find tasks to complete in exchange for money. The most active platform is Amazon’s Mechanical Turk. However, the platform is not available for requesters from non-US countries. Crowd- Flower is a crowdsourcing platform with fewer users than Mechanical Turk and fewer task are published. However, CrowdFlower is available to non- US requesters and is therefore suitable as a data gathering platform. The mentioned platforms share many characteristics but Mechanical Turk is the most studied in recent scholarly literature. Requesters on CrowdFlower can learn from the the characteristics of Mechanical Turk and its workers.

2.2.2.1 CrowdFlower

CrowdFlower is an online based crowdsourcing service founded 2007 in San Francisco, California. CrowdFlower makes use of other existing crowdsourc- ing platforms such as Amazon’s Mechanical Turk or SamaSource to provide a workforce suitable for tasks [31, 34]. On CrowdFlower a user can put out

9 a request for a specific task coupled with a price for completing said task. Workers can then choose to accept the task, complete it, and proceed to claim the reward. Further, the service provides an interface to manage the process and the performance of the workers. One of the features provided to gauge the level of accuracy is the possibility to request that the workers assigned to the task have completed a certain amount of previous tasks, at the cost of a reduced worker pool [35]. Another tool is the “golden standards”, where data with the correct input is mixed in with the regular tasks, and if the worker provides incorrect input they are notified and correct the answer is presented. This is to encourage the worker to maintain a higher level of of quality since each incorrect in- put will affect their accuracy score which results in fewer opportunities to complete tasks [31].

2.2.2.2 Amazon’s Mechanical Turk

The Mechanical Turk (MTurk) program was launched by Amazon in 2005 with the intention to provide a marketplace for human-based computation tasks. The service got its name from a 1770 playing that in reality is was an elaborate mechanical illusion used to hide that it was all controlled by a human chess master [36]. The validity of data gathered through MTurk has come under question due to the preconceptual fact that the quality of the data would be set in cor- relation to the reward. Since the primary goal the of participants is not to partake in scientific studies but rather to, as efficiently as possible, com- plete the task to receive the reward, this is a reasonable concern. Studies have shown that while moderate changes in reward amount do not indicate a change in data quality, it does introduce changes in attracted workers [24, 28]. If the reward is lower than that of tasks with equal difficulty there will generally be less participants and a larger portion of those will have previous experience with similar tasks. This could lead to the same work- ers gravitating towards similar studies and cause a “commons dilemma” for researchers [28]. Due to the wide diversity of the workers on MTurk there is a common misconception that there is independence across yielded results. Scientists should be wary that it is not uncommon for workers to share their expe- riences with other workers on discussion boards. These boards are usually used to share information regarding lucrative tasks and increases the likeli- hood of polluted data if an experiment relies heavily on a worker’s individual response [28]. Although these studies were conducted on MTurk, the results are relevant to CrowdFlower requesters.

10 2.2.2.3 Ethical concerns of using crowdsourcing

The increasing popularity of crowdsourced academical studies raises the question if paying for research is ethically sound. Many have voiced concerns that the working conditions of these workers are poor with an estimated 1.40$/hour wage [26, 37] and has categorised it as exploitation of the less fortunate. However, a 2012 study showed that 60-70% of American workers at Amazon’s Mechanical Turk used the platform as way to pass the time while earning money and did not as their main source of income [37].

2.3 Summary

To make advances in computer vision there is a heavy reliance on datasets. While there is a large amounts of visual datasets, they do not always have the right characteristics or labelling required for experiments. Making use of crowdsourcing to improve datasets has proven effective and comparable to traditional compiling methods as long as the conditions and wage of the task is taken into consideration.

3 Method

The development of the annotation method is divided into two phases as described in section 1.2. The first phase involves using the proposed an- notation method to compile an attention dataset, named LFW+at. The image generation and data gathering stages of the annotation method are described in this section. Image generation involved generating crowds consisting of images from the LFW dataset. The resulting crowd images were used in the data gath- ering process. The approach and reasoning is further explored in subsec- tion 3.1. Subsection 3.2 expands on the data collection and compilation stage, which includes both a beta test and a main study on crowdsourcing platform CrowdFlower. The beta test included testing of the generated images which were later used in the study conducted on the CrowdFlower platform.

3.1 Generating images

In order to provide a consistent way of labelling attention through human intelligence the evaluated item needs to be presented in a context such that

11 it can easily be understood by the person assigned to the task. This section will describe how the images used in the study were generated. Images of crowds were generated from images in the LFW dataset. Images from the LFW dataset were inserted into a realistic and recognisable envi- ronment. The chosen setting was a virtual lecture hall with six rows and eight seats per row. The lecture hall was filled using simple pseudo-random sampling without replacement on the LFW dataset. Sampled images were segmented to remove their background. Basic face size normalisation was performed and the images were inserted into a background image of a virtual lecture hall. Design choices were made to facilitate the work for workers at the chosen crowdsourcing platform and pass on defining attributes of the LFW set to the LFW+at set. In total, 42 images of 7 distinct crowds were generated. Information about the images and the crowds can be found in Appendix A. Several factors that led to the choice of the LFW face dataset are outlined in subsection 3.1.1. The design and variability of the dataset strongly con- tributed to this choice. In subsection 3.1.2 the pseudo-random sampling method of selecting crowd members from the LFW dataset is described. A description of the process of background removal, also referred to as fore- ground extraction, can be found in subsection 3.1.3. Subsection 3.1.4 briefly explains the need for head size normalization and the method used. Addi- tionally, the subsection describes the insertion of individual face images into the virtual lecture hall. Additional design choices are explained in subsec- tion 3.1.5. These include seat labelling, and the need for creating several permutation of each crowd.

3.1.1 Why LFW

LFW consists of over 13000 pictures (250x250 px) of celebrity faces and was used because it includes faces in natural poses from a variety of angles. The dataset possesses a natural variability regarding parameters such as age, color, pose, gaze and quality [1]. Crowds are typically only constrained by the position of the seats, and sometimes, not even that. For the resulting dataset to be usable in attention detection systems, operating on practically unconstrained crowds, the original dataset (LFW) must display sufficient variability. Datasets with, for example, only frontal faces or with the same illumination across all images could not be used. Another reason for choosing the LFW face dataset to compile the attention dataset is the practical design of the LFW dataset. Images in the LFW dataset are outputs from the Viola-Jones algorithm for face detection [1]. Using the LFW dataset enables a real attention detection system to use

12 (a) Mary_Tyler_Moore_0001 (b) Pete_Sampras_0011 (c) Choi_Sung-hong_0001

Figure 3: Examples of images before and after background removal the Viola-Jones face detection algorithm as a first step in detecting atten- tion.

3.1.2 Method of choosing members

The attendee or image at each position of a 6 row and 8 seat per row configuration was chosen pseudo-randomly. This was implemented using the pseudo-random number generator class SystemRandom from the Python module random, running on both Ubuntu and Windows systems (uses the function os.urandom() on UNIX-like systems and CryptGenRandom() on Windows systems). The generated crowd, represented by a 6x8 matrix, was saved, in both humanly readable form and as a binary file generated with the save function in the Python module NumPy. Each position in the matrix represents one of the seats in the virtual lecture hall and its occupant. Contents of such a file are retrievable by a Python application.

3.1.3 Background removal

The backgrounds of all images in a given crowd configuration were removed using a modified version of the interactive image segmentation tool Grab- Cut [38]. The tool was modified to control more parameters, facilitate back- ground removal from very dark images and to modify output. The LFW

13 Figure 4: Examples of images before and after head size normalisation. From left to right: model attendee Maria_Soledad_Alvear_Valenzuela_0002, John_Bolton_0016 and Michael_Patrick_King_0002 dataset consists of 250x250 px images in JPEG format. The backgrounds of the sampled images from this dataset were removed. The resulting images was saved as a PNG file as it offers support for an alpha channel and is com- pressed using a lossless compression algorithm. During this process certain decisions had to be made regarding exclusion and other editing. Objects such as microphones or tennis rackets were removed if they did not signifi- cantly cover the subject’s face or body, were partially outside the image, or both. In Figure 3a, a microphone was removed as it only covered the fingers of the subject. Shoulders were, when necessary and possible, smoothed to facilitate future size normalisation of faces and to avoid unnatural sharp edges in our output. Examples where this has been done can be found in Figures 3a and 3b. Faces heavily occluded by relatively large, partially de- picted objects were extracted by removing foreground items. An example of this can be found in Figure 3c.

3.1.4 Actual image generation

After background removal rudimentary face size normalisation was per- formed using image processing software GIMP. The most extreme outliers in a given crowd were identified visually and resized after a model attendee. The distance between eyes and mouth on the outlier was modified to become closer to the model attendee. Normalisation was performed to reduce dis-

14 (a) alpha_permutation_3 partitioned into zones

(b) alpha_permutation_2 partitioned into zones

Figure 5: Example of a crowd partitioned into zones tractions caused by extreme differences in attendee face size. This was done to facilitate the task for CrowdFlower workers. Examples are shown in Fig- ure 4. The images were then positioned on top of a 2000x820 px background image consisting of labelled seats. Each consecutive row following the first one was scaled by a factor of 0.85. This value was chosen experimentally to give the image a sense of depth while later rows remain discernible.

3.1.5 Design details

Crowd attendants were chosen pseudo-randomly from the entire LFW dataset to avoid personal bias. To preserve the natural variability of LFW in our dataset we generate several distinct crowds. To make the images more re- alistic and to catch inattentive workers, positions in the crowd are left un- occupied with probability 0.3. Each image therefore contains around 34

15 attendees occupying as many seats in the 48 seat lecture hall. Seats were labelled to enable workers to enter data. To simplify the in- put process for participants in the various data gathering methods, seats in the image are labelled to match a standard spreadsheet (e.g MS Excel and Google Drive’s spreadsheet). This promotes association and takes ad- vantage of workers’ familiarity with widely used applications. Each seat in the virtual lecture hall can be identified by two characters, one number and one letter. Each row in the virtual lecture hall was numbered, starting from the topmost, as row 1, ending at the bottommost, as row 6. Each seat in any given row is label is assigned a letter A-H in alphabetical order, from left to right. In the image, the label, e.g. 4F was positioned in the top left corner of each seat. Label positioning, size and colour was considered to maximise visual information and prevent distractions caused by excessive clutter. One seat label in the top left corner was used rather than minimis- ing the possibility of seat label occlusion by showing a duplicate in the top right corner. When possible, faces of attendees were translated to the right within their images, to increase the probability of not occluding the seat label while minimising clutter. Seat label size was chosen experimentally to remain discernible in row 1. A desaturated yellow colour was chosen to minimise distractions while maximising visibility. Visibility is maximised, as yellow and blue are complementary colours, thus creating strong contrast [39]. Perception of a crowd member’s level of attention could be dependent on its position in the crowd. To counter the effect position has on perceived attention, several permutations of each crowd are generated. To do this, the crowd is partitioned into six zones. Every member appears only once in each zone, across the six permutations. Furthermore, each crowd member appears exactly once on each row and on each column across all permutations. See Figures 5a and 5b for examples.

3.1.6 Generated files

In total, 7 distinct crowds were generated, each with 6 permutations. The crowds were given the names alpha, beta, gamma, delta, epsilon, zeta and eta. Information about all generated files can be found in Appendix A. An example of the a human readable crowd permutation definition file is shown below. An example of a small section of a generated crowd permutation image is shown in Figure 6.

16 Extract from a human readable textual representation of crowd permutation alpha_permutation_3 (first three seats of each row). None,Bill_Frist_0009.jpg,Alain_Cervantes_0001.jpg,Dick_Cheney_0007.jpg Mariah_Carey_0007.jpg,Jason_Kidd_0006.jpg,None Tony_Blair_0080.jpg,Recep_Tayyip_Erdogan_0015.jpg,George_W_Bush_0101.jpg None,Michael_Bloomberg_0005.jpg,Pete_Sampras_0011.jpg None,John_Bolton_0016.jpg,None Gerhard_Schroeder_0009.jpg,Pedro_Malan_0004.jpg,None

Figure 6: Small section of the generated image of crowd permutation alpha_permutation_3

3.1.7 Software used

• Python 2.7.9 • OpenCV 2.4.9 • NumPy 1.8.2 • modified GrabCut • GIMP 2.8.10

3.2 Collecting data

Data was collected through a beta test and a study or survey on the crowd- sourcing platform CrowdFlower. The beta test was used to maximise re- liability of the CrowdFlower study. Given a simple pseudo-random sam- pling method the sampled images might not be representative of the original

17 Figure 7: Spreadsheet used during beta test dataset. This problem is discussed in section 6.3. The biggest concern re- garding data quality is workers that are frivolous and inattentive. Workers that have more knowledge about attention than assumed can not affect the study negatively.

3.2.1 Beta test

To determine wage and task specification for workers, time and attention labels were collected from three beta testers completing the task of annotat- ing the seven distinct crowds listed in section 3.1.6 and Appendix A. These crowds, six permutations of each, and their images, were generated by the method described in section 3.1. Each tester was given the image of one permutation of each crowd and was presented with a task specification. The given crowd permutation was chosen randomly without replacement. This was done to remove positional bias in the gathered data. Although the data is secondary to this stage, it was recorded and analysed.

18 3.2.1.1 Question A professor is holding a lecture in the lecture hall shown in the picture. Examine each person in the crowd and determine how much attention they are paying to the lecture. 1. Unsure 2. Definitely not paying attention/Empty 3. Probably not paying attention 4. Probably paying attention 5. Definitely paying attention

3.2.1.2 Setting

Each tester was presented with a spreadsheet, as the one shown in Fig- ure 7, where the cells to be used were highlighted. The background color was chosen to resemble the seat color in the virtual lecture hall to promote association. The highlighted cells were the rectangular area from A1 to H6, which is a 6x8 cells area, with rows 1-6 and columns A-H. This corresponds to the virtual lecture hall’s 6 rows and 8 seats per row, with rows 1-6 and seats A-H, which also promotes association. The tester was prompted to read the question and presented with the first assigned image. For each crowd, the tester was presented with a new spreadsheet and image.

3.2.2 CrowdFlower study

The CrowdFlower study was created by combining the CrowdFlower Markup Language (CML) with JavaScript. This enabled randomisation of the crowd permutation image presented to the worker. Additionally, information about which crowd permutation the worker had been presented was added as a hidden field in the input form. It was not possible to implement a gold standard for the crowd members as their level of attention was unknown. Furthermore, the gold standard data would have to be changed according to the image presented to the worker. The CrowdFlower platform does not support a way to make this change during the study. The task was given the title Determine attention level of crowd members. In addition to the short instructions given for the beta test, a more detailed description of the task was provided. An example of input was provided to minimise worker misunderstandings and facilitate the worker’s task. Fur- ther, the arbitrary scale used in the beta test was replaced by a scale that

19 closely follows guidelines for Likert scales. Instructions and further informa- tion given to the worker is outlined in subsections below. An image of the task setting can be found in Appendix D.

3.2.2.1 Why CrowdFlower

There are three main reasons why CrowdFlower was chosen over other simi- lar crowdsourcing services. CrowdFlower makes use of other vendors, is easy to use and has few requester restrictions. Since the CrowdFlower platform not only uses their own workforce but also outsources work to other crowd- sourcing services such as Amazon’s Mechanical Turk and SamaSource it was an ideal fit for this study. The used developed should hold valid on other crowdsourcing platforms and not only CrowdFlower. It was easy to setup and customise since the templates provided could be altered with CML and JavaScript. This meant additional control over factors such as image ran- domisation, validation and what data to pass along with the submission. To prevent fraud some vendors have added requirements for requesters to provide some form of verification (social security number, company ID) be- fore being allowed access to the service. CrowdFlower did not require such additional information and was therefore easier to use.

3.2.2.2 Question and instructions

A professor is holding a lecture in the depicted lecture hall. Examine each person in the crowd and determine how much attention they are paying to the lecture. Rows in the lecture hall are numbered 1-6 (top to bottom). Seats in a row are labelled A-H (left to right). It has 48 seats and around 35 people. Enter 1 value in each cell of the spreadsheet below according to: 1. Empty/Not paying attention 2. Probably not paying attention 3. Uncertain 4. Partially paying attention 5. Paying attention Completing this survey takes 6-12 minutes

20 3.2.2.3 Example

If you enter values 1,2,3,3,4,4,5,5 into the first row of the spreadsheet, it will be interpreted as: • The seat labelled 1A is either empty or the person in that seat is not paying attention to the lecture. • The seat labelled 1B is not empty and the person in that seat is prob- ably not paying attention to the lecture. • The seats labelled 1C and 1D are not empty, but you are unsure of the level of attention of the persons sitting there. • The seats labelled 1E and 1F are not empty and the persons in those seats are partially/probably paying attention to the lecture. • The seats labelled 1G and 1H are not empty and the person in those seats are definitely paying attention to the lecture.

3.2.2.4 Image

Seven crowds with six permutations each were used during the study. These are the same crowds used during the beta test. However, all permutations were used during the CrowdFlower study. The crowds are listed in sec- tion 3.1.6 and additional information can be found in Appendix A. A given worker was assigned one out of forty two images. The image assigned and shown to a given worker was chosen pseudo-randomly using JavaScript’s Math.random() method.

3.2.2.5 Input

An array of text inputs were arranged in a matrix where each input cor- responds to one seat in the generated image. To avoid user errors each field required an integer 1-5 for the submission to be accepted.

3.2.2.6 Wage

By using the data gained through the beta test the estimated timeframe for the survey was 12 minutes. The wage was set to 0.1 USD per task by comparing the timeframe gained through the beta test and aligned it with other academic tasks of similar estimated length.

21 4 Beta test results, analysis and conclusions

4.1 Results

The time taken to complete the labelling of each crowd was recorded and is listed in Table 2.

Tester 1 Tester 2 Tester 3 Total Mean SD 1st crowd 9:44 6:36 10:02 26:22 8:47 1:33 2nd crowd 7:12 2:51 6:54 16:57 5:39 1:59 3rd crowd 4:28 2:18 5:27 12:13 4:04 1:19 4th crowd 2:16 2:20 5:04 9:40 3:13 1:18 5th crowd 2:03 1:45 4:13 8:01 2:40 1:06 6th crowd 1:53 1:51 3:28 7:12 2:24 0:45 7th crowd 1:58 1:51 3:25 7:14 2:25 0:43 Total 29:34 19:37 38:33 87:44

Table 2: Beta test completion times

An extract from beta_test_results.csv showing image filename, mean and standard deviation, in that order (None refers to all empty seats) can be found in Table 3.

None 2.000000 0.000000 Al_Gore_0001 2.000000 0.000000 Alain_Cervantes_0001 2.333333 0.471405 Alejandro_Atchugarry_0002 4.333333 0.471405 Andrew_Firestone_0001.jpg 2.666667 0.471405 Angela_Bassett_0005 4.333333 0.471405 Anibal_Ibarra_0002 4.333333 0.471405 Anthony_Garotinho_0001 3.000000 0.000000 Anzori_Kikalishvili_0001 4.333333 0.471405 Aretha_Franklin_0001 4.000000 0.000000

Table 3: Small sample of beta test results

4.2 Analysis and conclusions

Beta testers completed their first annotation task in a mean time of 8:47. The beta testers are not a random sample and are not the population from which inferences about the data are extracted. However, a simple analysis

22 should give an indication of the time it could take workers at CrowdFlower to complete the annotation task. Given our limited tests, and assuming a normal distribution, there is a 95% probability that workers will finish the beta test task, when seen for the first time, before 11 minutes and 49 seconds. This number corresponds to the upper bound of a 95% confidence interval on a normal distribution with mean 8:47 and standard deviation 1:33. After discussions and feedback from Lecturer Richard Glassey of the TCS school of KTH Royal Institute of Technology, the arbitrary scale used during beta testing was deemed to be sub-optimal. A Likert-scale was decided upon for the CrowdFlower study.

5 CrowdFlower study results and analysis

5.1 Primary results

The data from CrowdFlower consisting of 9600 inputs can not be listed but is available as a csv file. Information about the csv file’s formatting and an url where it can be found is included in Appendix B. The LFW+at dataset of face image and attention value pairs can be found in Appendix C. In the third and fourth columns of the table, reliability measurements have been included.

5.2 Secondary results

CrowdFlower offer its workers to leave feedback after a task is completed. This poll is not mandatory. 47 workers answered the pool which yielded the following overall score: instructions clear: 4.5 / 5 test questions fair: 4.4 / 5 ease of job: 4.4 / 5 pay: 4.4 / 5

5.3 Analysis

Two hundred tasks were published on CrowdFlower resulting in 200 sub- missions with 9600 worker inputs in total. The total amount of distinct face images was 228. First, one of the 200 submitted tasks was removed

23 Crowd name Cronbach’s α Rows alpha 0.856 22 of 29 beta 0.843 21 of 26 gamma 0.901 19 of 27 delta 0.757 27 of 35 epsilon 0.933 22 of 26 zeta 0.891 24 of 29 eta 0.836 21 of 27

Table 4: Cronbach’s α of CrowdFlower submissions because the permutation name was missing. The submitted results could not be coupled with the corresponding crowd image. As mentioned in sec- tion 3.2.2 golden standards could not be used during the study. For this reason, answers from inattentive users were removed at this stage. A total of 44 entries were removed because the worker was blatantly inattentive. The entire entry was removed when the worker assigned a value larger than 1 to an empty seat. This left a total of 155 submissions at this stage. The five category Likert scale data was then collapsed into three categories. The categories 1 and 2 are negative categories where the worker has expressed that the depicted subject is not paying attention. Category 3 corresponds to the worker being uncertain of the level of attention of the depicted sub- ject. Further, category 4 and 5 are positive categories where the worker has expressed that the depicted subject is paying attention. Categories 1 and 2 were combined and so were categories 4 and 5, leaving 3 distinct categories. The resulting set of categories of each Likert item were: 1. Empty/Not paying attention 2. Uncertain 3. Paying attention. The data from a single position in a crowd is ordinal as it comes from a Likert scale with ranked items, rather than absolute values or ratios. Therefore, no assumptions about the distance between the categories can be made. The Cronbach’s α of all submissions to a particular crowd was calculated to estimate the lower bound reliability of the submissions. Cronbach’s α is a measure of reliability and internal consistency of the scale. A test with α > 0.9 is said to be good and α > 0.7 to be acceptable [40]. The results are listed in Table 4. Information about the crowds used during this study can be found in Appendix A. The mode of all submissions to each face image was calculated. The mode is the value of the face images’ resulting label in the LFW+at dataset. Due

24 133133223333333313332312

Table 5: Submissions to Anibal_Ibarra_0002

Figure 8: Second LFW image of Anibal Ibarra to the ordinal nature of the data, the frequency of answers which corre- sponded with the mode was used to measure the reliability and agreement of the attention label. In Table 5, the submissions to the second image of Anibal Ibarra in the LFW dataset are shown. The face image can be seen in Figure 8. The resulting mode was 3 and the agreement for this value was 66.7%. The label of this image is therefore 3, which corresponds to the depicted subject paying attention. Furthermore, the interquartile range of the collective submissions to each image was calculated as a measure of the spread or dispersion of the answers.

6 Discussion

6.1 Image generation

Exclusively using the GrabCut tool to remove backgrounds was time con- suming and alternative solutions should be pursued. Using face detection to automatically provide input to the grabcut algorithm works relatively well if the researcher is only interested in the faces. When extracting "busts" from images there is no great advantage to this method over using grabcut since the face detection algorithm has seemingly poor performance when the body is rotated or obscured by objects in the foreground. However, only ex- tracting faces from the images in the face dataset might be a valid strategy that could yield good results during data gathering. Because of the wide variety of faces presented in the LFW dataset some needed to be adjusted to size. Few images depicted a subject with a sig-

25 nificantly difference in head size compared to the chosen model attendee. Therefore, head size normalisation was performed manually. This normali- sation process could have been done automatically and might be necessary when compiling a larger subsets.

6.2 CrowdFlower setting

In order to make the the task as comprehensible and intuitive as possible it had to convey the purpose and goal of the survey clearly to avoid user error. Even so, 22% of submissions were discarded due to incorrect labelling of an empty seat. Discarding such a large percentage of submissions is a severe problem as it is a waste of money for the researcher. However, the submissions that display such a high degree of worker inattentiveness should not be used. It is difficult to isolate the cause of the failures as there are few ways to determine if a survey submission is incorrect when it is subject to interpretation. However, it is clear that further study into this phenomenon is warranted. An issue with having a random function deciding which image to show to the worker is that every crowd permutation might not be sufficiently represented. While the pseudo-random function should approach an even distribution as the number of task increases, this does not carry any guarantee that each permutation has the same number of evaluations. A sequential rotation of permutations would solve this problem, but has limitations in terms of implementation on the CrowdFlower platform. A solution that eliminates both problems mentioned above is to post a new task for each crowd permutation to be evaluated. This enables the researcher to supply gold standard data defined to discard submissions with incorrectly labelled empty seats. Such submissions are then eliminated by CrowdFlower, receive no payment and are not recorded in the results. Additionally, this method of posting tasks enables the researcher to control the amount of evaluations of each crowd permutation. The researcher can request the same amount of evaluations of every crowd permutation.

6.3 Sampling method

Members of a crowd were chosen using simple random sampling. This sam- pling method might give samples that are not representative of the popu- lation at large. A different approach would be stratified sampling, which is not vulnerable to sampling errors due to randomness, but could not be used in this case. The images would be classified by parameters that affect attention and then sampled. This would allow us to easily draw inferences

26 about the classes. Unfortunately these parameters are unknown. However, parameters that might strongly affect the perceived level of attention of a depicted subject are head rotation and gaze.

6.4 Further development

Future work in the area of attention detection could revolve around further improving the annotation method. Another interesting study would be to extract attributes from the images in the LFW+at dataset. These attributes could be head rotation, gaze or any other characteristic of the image. A learning system could be trained with the extracted attributes as input and the attention labels as expected result. The trained system might then be able to adequately classify unseen images.

7 Summary and conclusions

The method presented in this essay yielded statistically promising results for compiling a dataset with with human performance attention labelling. The submissions gathered from CrowdFlower were deemed reliable by the calculating their Cronbach’s α. None of the crowds α was lower than 0.7. Most crowds instead approached α = 0.9 which is a good result. Images in the LFW dataset were annotated with attention labels according to the most common answers received from CrowdFlower submissions. Instead of removing images with low agreement rating from the LFW+at dataset, two additional labels were included for each image. These labels are the mode frequency and interquartile range. Both values should be taken into consideration by users of the LFW+at dataset. The proposed annotation method did contain practical flaws. To achieve the needed level of accuracy, the image generation algorithms used in the creation of the crowds required too much manual input. For this method to be useful on large datasets more effort must be directed towards reli- ably automating the image generation process. Additionally, when using CrowdFlower as the main platform for the data gathering process, the per- mutations of all generated crowds should be included in only one task each. Gold standard data should be supplied for each task. A stratified sampling method should be considered if attributes affecting the perceived, or actual, level of attention of depicted subjects can be deduced or inferred. The method described in this essay, with the improvements mentioned above, can be used to reliably annotate face datasets with human performance attention labels.

27 References

[1] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007. [2] Merriam-Webster.com. Crowdsourcing. http://www. merriam-webster.com/dictionary/crowdsourcing. [Online, Ac- cessed 7 May 2015]. [3] The facial recognition technology (feret) database. http://www.itl. nist.gov/iad/humanid/feret/feret_master.html. [Online, Acessed 4 March 2015]. [4] P Jonathon Phillips, Harry Wechsler, Jeffery Huang, and Patrick J Rauss. The feret database and evaluation procedure for face-recognition algorithms. Image and vision computing, 16(5):295–306, 1998. [5] Scface - surveillance cameras face database. = http://www.scface.org/. [Online, Acessed 3 March 2015]. [6] Mislav Grgic, Kresimir Delac, and Sonja Grgic. Scface–surveillance cameras face database. Multimedia tools and applications, 51(3):863– 879, 2011. [7] Daniel Lundqvist, Anders Flykt, and A Öhman. The karolinska di- rected emotional faces (kdef). CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, pages 91–630, 1998. [8] Humintell. Japanese and caucasian facial expressions of emotion (jacfee). http://www.humintell.com/jacfee/. [Online, Acessed 2 March 2015]. [9] Lior Wolf, Tal Hassner, and Italy Maoz. Face recognition in uncon- strained videos with matched background similarity. In Computer Vi- sion and Pattern Recognition (CVPR),2011 IEEE Conference on, pages 529–534. IEEE. [10] Timothy D Sweeny and David Whitney. Perceiving crowd attention en- semble perception of a crowd’s gaze. Psychological science, 25(10):1903– 1913, 2014. [11] Kdef and akdef. http://www.emotionlab.se/sites/default/files/ About%20KDEF.pdf. [Online, Accessed 5 March 2015]. [12] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.

28 [13] Sachin Sudhakar Farfade, Mohammad Saberian, and Li-Jia Li. Multi- view face detection using deep convolutional neural networks. arXiv preprint arXiv:1502.02766, 2015. [14] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face veri- fication. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1701–1708. IEEE, 2014. [15] Paul Viola and Michael J Jones. Robust real-time face detection. In- ternational journal of computer vision, 57(2):137–154, 2004. [16] Raúl Rojas. Neural networks: a systematic introduction. Springer Sci- ence & Business Media, 1996. [17] University of Massachusetts. Labeled faces in the wild. http:// vis-www.cs.umass.edu/lfw/. [Online, Accessed 2 March 2015]. [18] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008. [19] Face database. https://www.sheffield.ac.uk/eee/research/iel/ research/face. [Online, Accessed 29 April 2015]. [20] The cmu multi-pie face database. http://www.multipie.org/. [On- line, Acessed 2 March 2015]. [21] Fia "face-in-action" dataset. http://www.flintbox.com/public/ project/5486/. [Online, Acessed 2 March 2015]. [22] Mmi facial expression database. http://www.mmifacedb.eu/. [Online, Acessed 3 March 2015]. [23] Maja Pantic, Michel Valstar, Ron Rademaker, and Ludo Maat. Web- based database for facial expression analysis. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pages 5–pp. IEEE, 2005. [24] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5, 2011. [25] David O Sears. College sophomores in the laboratory: Influences of a narrow data base on social psychology’s view of human nature. Journal of Personality and Social Psychology, 51(3):515–530, 1986. [26] Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Run- ning experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411–419, 2010.

29 [27] Joseph Henrich, Steven J Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and brain sciences, 33(2-3):61–83, 2010. [28] Jesse Chandler, Pam Mueller, and Gabriele Paolacci. Nonnaïveté among amazon mechanical turk workers: Consequences and solutions for behavioral researchers. Behavior research methods, 46(1):112–130, 2014. [29] Samuel D Gosling, Simine Vazire, Sanjay Srivastava, and Oliver P John. Should we trust web-based studies? a comparative analysis of six preconceptions about internet questionnaires. American Psychologist, 59(2):93, 2004. [30] Mick P Couper. Review: Web surveys: A review of issues and ap- proaches. Public opinion quarterly, pages 464–494, 2000. [31] Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Me- chanical Turk, CSLDAMT ’10, pages 80–88, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [32] James Surowiecki. The wisdom of crowds. Anchor, 2005. [33] About nasaclickworkers survey. http://54.200.60.38/ Clickworkers_dev/. [Online, Accessed 5 March 2015]. [34] Samasource about. http://www.samasource.org/. [Online, Accessed 5 April 2015]. [35] Overwiew of the crowdflower platform. http://www.crowdflower. com/overview. [Online, Accessed 5 March 2015]. [36] . Maelzel’s chess-player. Southern Literary Messenger, 2(5):318–326, 1836. [37] Winter Mason and Siddharth Suri. Conducting behavioral research on amazon’s mechanical turk. Behavior research methods, 44(1):1–23, 2012. [38] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grab- cut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004. [39] Stephen Westland, Kevin Laycock, Vien Cheung, Phil Henry, and For- ough Mahyar. Colour harmony. JAIC-Journal of the International Colour Association, 1, 2012.

30 [40] Paul Kline. Handbook of psychological testing. Routledge, 2013.

31 A Crowd details

All crowd definition files and images are available at: http://www.csc.kth.se/~romuld/kex_crowds.zip.

A.1 Details and naming

For the beta test and CrowdFlower study, 7 distinct crowd were gener- ated by simple pseudo-random sampling without replacement. The crowd were named alpha, beta, gamma, delta, epsilon, zeta and eta. 6 images of each crowd were created, each representing one permutation of the crowd as described in section 3.1.5. This resulted in a total of 42 images. Each permutation follows the naming format: _permutation_ where the permutation number is an integer 0-5. As an example, one per- mutation of crowd alpha was named alpha_permutation_0.

A.2 Post-study changes

Three crowd names have been changed to follow the same naming scheme as the other four crowds.

32 B CrowdFlower submissions

All submissions from CrowdFlower are available at: http://www.csc.kth.se/~romuld/kex_submissions.csv.

B.1 Format

The file contains one row for each submission from CrowdFlower. The first row defines the label of each column. Columns labelled 1a, 1b, 1c, ..., 6h, which are columns with 0-based index 10 to 57, are the workers’ evaluation of the crowd permutation images listed in row labelled permutation_name with 0-based index 58.

B.2 Post-study changes

Worker ids and ip-addresses have been excluded. Three crowd names have been changed to follow the same naming scheme as the other four crowds.

33 C The LFW+at dataset

• Mode 1: Not paying attention • Mode 2: Uncertain • Mode 3: Paying attention

Image name Mode Mode frequency IQR1 1 Al_Gore_0001 1 1.0000000000 0.00 2 Alain_Cervantes_0001 1 0.7727272727 0.00 3 Alejandro_Atchugarry_0002 3 0.4444444444 1.00 4 Andrew_Firestone_0001 3 0.4090909091 1.75 5 Angela_Bassett_0005 3 0.5185185185 1.00 6 Anibal_Ibarra_0002 3 0.6666666667 1.00 7 Anthony_Garotinho_0001 1 0.4761904762 2.00 8 Anzori_Kikalishvili_0001 3 0.7777777778 0.00 9 Aretha_Franklin_0001 3 0.4736842105 2.00 10 Ariel_Sharon_0041 3 0.4761904762 1.00 11 Ariel_Sharon_0077 3 1.0000000000 0.00 12 Atal_Bihari_Vajpayee_0017 3 0.7777777778 0.00 13 Bernard_Ebbers_0001 3 0.9047619048 0.00 14 Bill_Frist_0009 3 0.6363636364 1.00 15 Bill_Nelson_0001 3 0.9523809524 0.00 16 Binyamin_Ben-Eliezer_0002 1 0.5238095238 2.00 17 Binyamin_Ben-Eliezer_0006 1 0.4761904762 2.00 18 Bob_Huggins_0004 3 0.4444444444 2.00 19 Brad_Gushue_0001 3 0.4090909091 2.00 20 Britney_Spears_0012 1 0.7142857143 1.00 21 Bryan_Chui_0001 1 0.4736842105 2.00 22 Bulent_Ecevit_0002 3 0.8947368421 0.00 23 Carlo_Azeglio_Ciampi_0001 3 0.8750000000 0.00 24 Carlos_Ghosn_0002 3 0.6842105263 1.00 25 Carlos_Moya_0007 1 0.4444444444 2.00 26 Carlos_Moya_0019 3 0.9047619048 0.00 27 Carlton_Dotson_0001 3 0.8571428571 0.00 28 Chen_Shui-bian_0002 3 0.5185185185 2.00 29 Choi_Sung-hong_0001 3 0.7727272727 0.00 30 Chris_Cirino_0001 3 0.8095238095 0.00 31 Chris_Thomas_0001 3 0.4736842105 2.00 32 Christian_Bale_0001 1 0.4736842105 2.00 33 Chuck_Woolery_0001 3 0.7727272727 0.00 34 Colin_Cowie_0001 3 0.7894736842 0.00 35 Colin_Powell_0016 3 0.7407407407 0.50 36 Colin_Powell_0061 3 0.8421052632 0.00 37 Colin_Powell_0085 3 0.8095238095 0.00 38 Colin_Powell_0098 3 0.8095238095 0.00 39 Colin_Powell_0141 3 1.0000000000 0.00 40 Colin_Powell_0230 3 0.7083333333 1.00 41 Cori_Enghusen_0001 1 0.4761904762 2.00 42 Dariusz_Michalczewski_0001 1 0.5238095238 1.00 43 Daryl_Hannah_0002 3 0.5555555556 1.00 44 David_Bell_0001 3 0.5789473684 2.00

34 Image name Mode Mode frequency IQR1 45 Dean_Barker_0001 3 0.5238095238 2.00 46 Derek_King_0001 3 0.4090909091 2.00 47 Dick_Cheney_0007 3 0.4545454545 1.75 48 Donald_Rumsfeld_0059 3 0.8095238095 0.00 49 Edward_Arsenault_0001 3 0.5238095238 2.00 50 Edwina_Currie_0003 3 0.8095238095 0.00 51 Emma_Watson_0001 3 0.9523809524 0.00 52 Evo_Morales_0001 3 0.5416666667 1.00 53 Faye_Dunaway_0003 3 0.5000000000 2.00 54 Floyd_Mayweather_0001 3 0.4545454545 2.00 55 Francois_Botha_0001 3 0.5909090909 1.00 56 Frank_Solich_0005 3 0.7727272727 0.00 57 Franz_Fischler_0004 3 0.5454545455 1.00 58 Fred_Thompson_0002 3 0.8181818182 0.00 59 Gabriel_Valdes_0002 3 0.4814814815 2.00 60 George_Allen_0001 1 0.5454545455 2.00 61 George_Clooney_0002 1 0.9473684211 0.00 62 George_HW_Bush_0005 3 0.4761904762 2.00 63 George_Pataki_0003 3 0.6363636364 1.00 64 George_W_Bush_0003 3 0.4545454545 2.00 65 George_W_Bush_0004 1 0.5925925926 2.00 66 George_W_Bush_0101 3 0.8181818182 0.00 67 George_W_Bush_0308 3 0.7142857143 1.00 68 George_W_Bush_0329 3 0.7916666667 0.00 69 George_W_Bush_0390 1 0.5238095238 2.00 70 George_W_Bush_0395 3 0.6818181818 2.00 71 George_W_Bush_0408 3 0.9523809524 0.00 72 George_W_Bush_0486 3 0.5000000000 2.00 73 Georgina_Bardach_0001 3 0.7037037037 1.00 74 Gerhard_Schroeder_0009 3 0.9090909091 0.00 75 Gerhard_Schroeder_0047 3 0.8571428571 0.00 76 Glen_Sather_0001 3 0.6842105263 1.00 77 Guillermo_Coria_0023 1 0.6296296296 2.00 78 Gunter_Pleuger_0006 3 0.8571428571 0.00 79 Gustavo_Kuerten_0003 3 0.5000000000 1.00 80 Guus_Hiddink_0001 3 0.5416666667 1.00 81 Habib_Rizieq_0003 1 0.4210526316 2.00 82 Hal_McCoy_0001 3 0.6818181818 1.00 83 Han_Sung_Joo_0001 3 0.5000000000 2.00 84 Harrison_Ford_0010 3 0.9047619048 0.00 85 Heidi_Klum_0003 1 0.5000000000 2.00 86 Hernan_Diaz_0001 1 0.9523809524 0.00 87 Hugo_Chavez_0031 1 0.5454545455 2.00 88 Humberto_Coelho_0001 3 0.8888888889 0.00 89 Igor_Ivanov_0006 3 0.7272727273 0.75 90 Itzhak_Perlman_0001 3 0.6250000000 1.00 91 Jackie_Chan_0001 3 0.6250000000 1.75 92 Jacques_Chirac_0018 3 0.6315789474 1.00 93 Jacques_Chirac_0049 3 0.8888888889 0.00 94 Jaime_Orti_0001 1 0.4074074074 2.00 95 James_Murdoch_0001 3 0.8518518519 0.00 96 Jamie_Cooke_0001 3 0.6666666667 1.00

35 Image name Mode Mode frequency IQR1 97 Jason_Kidd_0006 1 0.4545454545 2.00 98 Jason_Kidd_0009 3 0.4814814815 1.00 99 Jayson_Williams_0001 3 0.6818181818 1.00 100 Jean_Charest_0001 1 0.4210526316 2.00 101 Jean_Chretien_0015 1 0.5454545455 1.75 102 Jean_Chretien_0052 3 0.6296296296 2.00 103 Jennifer_Lopez_0011 3 0.8095238095 0.00 104 Jennifer_Lopez_0012 1 0.5000000000 1.75 105 Jennifer_Reilly_0002 3 0.8095238095 0.00 106 Jeremy_Greenstock_0001 3 1.0000000000 0.00 107 Jeremy_Greenstock_0020 3 0.5000000000 2.00 108 Jeremy_Greenstock_0023 3 0.8333333333 0.00 109 Jerry_Colangelo_0001 3 0.4090909091 2.00 110 Jerry_Regier_0003 3 0.4285714286 2.00 111 Jim_Tressel_0002 3 0.7500000000 0.00 112 Jim_Wong_0001 3 0.7272727273 0.75 113 Jimmy_Gurule_0001 3 0.9166666667 0.00 114 Joe_Gatti_0002 1 0.6666666667 1.00 115 John_Ashcroft_0033 3 0.6363636364 1.75 116 John_Bolton_0016 3 0.6363636364 1.75 117 John_Bond_0001 3 0.7500000000 0.75 118 John_Howard_0008 3 0.8571428571 0.00 119 John_Howard_0009 3 0.8571428571 0.00 120 John_Jones_0001 3 0.8095238095 0.00 121 John_McCallum_0001 3 0.6296296296 2.00 122 John_Paul_DeJoria_0001 3 0.8095238095 0.00 123 John_Snow_0014 3 0.7500000000 0.75 124 John_Wolf_0001 3 0.7777777778 0.00 125 Jorge_Marquez-Ruarte_0001 3 0.5000000000 1.00 126 Joseph_Galante_0001 3 0.7727272727 0.00 127 Juan_Pablo_Montoya_0001 3 0.5714285714 1.00 128 Junichiro_Koizumi_0041 3 0.4736842105 1.50 129 Katherine_Harris_0001 1 0.4545454545 1.75 130 Keizo_Yamada_0001 3 0.6666666667 1.00 131 Kirk_Douglas_0001 3 0.7142857143 1.00 132 Kurt_Busch_0002 3 0.8333333333 0.00 133 Lana_Clarkson_0001 3 0.7368421053 0.50 134 Leandrinho_Barbosa_0001 3 0.4074074074 2.00 135 Li_Changchun_0001 1 0.7894736842 0.00 136 Lori_Berenson_0001 3 0.9166666667 0.00 137 Luis_Pujols_0001 3 0.4736842105 2.00 138 Luiz_Inacio_Lula_da_Silva_0016 3 0.8571428571 0.00 139 Luke_Walton_0001 1 0.6842105263 1.00 140 Mahmoud_Abbas_0027 3 0.5714285714 1.00 141 Makhdoom_Amin_Fahim_0002 3 0.6666666667 1.00 142 Maria_Luisa_Mendonca_0001 1 0.3636363636 1.75 143 Maria_Soledad_Alvear_Valenzuela_0002 2 0.3636363636 2.00 144 Mariah_Carey_0007 1 0.3636363636 2.00 145 Mario_Alfaro-Lopez_0001 3 0.6842105263 1.00 146 Mario_Lemieux_0001 3 0.4090909091 2.00 147 Mark_Foley_0001 1 0.4814814815 2.00 148 Marricia_Tate_0001 1 0.5416666667 1.00

36 Image name Mode Mode frequency IQR1 149 Martin_Howard_0001 3 0.9090909091 0.00 150 Martin_Verkerk_0003 1 0.7142857143 1.00 151 Mary_Tyler_Moore_0001 3 0.9090909091 0.00 152 Michael_Ballack_0001 3 0.6818181818 1.00 153 Michael_Bloomberg_0005 3 0.4090909091 1.75 154 Michael_Dell_0001 3 0.6818181818 2.00 155 Michael_Jackson_0009 3 0.7142857143 1.00 156 Michael_Olowokandi_0001 1 0.6190476190 1.00 157 Michael_Patrick_King_0002 3 0.6363636364 1.00 158 Michael_Schumacher_0011 3 0.7894736842 0.00 159 Mike_Stefanik_0001 3 0.9473684211 0.00 160 Mike_Weir_0007 1 0.4285714286 2.00 161 Morgan_Fairchild_0001 1 0.5238095238 2.00 162 Muhammad_Saeed_al-Sahhaf_0004 3 0.8571428571 0.00 163 Naji_Sabri_0002 3 0.4090909091 2.00 164 Nan_Wang_0004 3 0.6250000000 2.00 165 Nancy_Pelosi_0001 3 0.5555555556 1.00 166 Nicolas_Macrozonaris_0001 3 0.4090909091 2.00 167 Nikki_McKibbin_0001 1 0.4166666667 2.00 168 Oliver_Neuville_0001 3 0.5833333333 1.75 169 Orlando_Bloom_0002 3 0.6250000000 2.00 170 Paradorn_Srichaphan_0007 1 0.5263157895 2.00 171 Patrick_Eaves_0001 3 0.5238095238 2.00 172 Paul_Bremer_0013 3 0.7619047619 0.00 173 Paul_Martin_0003 3 0.7727272727 0.00 174 Paul_McCartney_0004 3 0.7142857143 1.00 175 Paul_Sarbanes_0001 3 0.8947368421 0.00 176 Pedro_Malan_0004 1 0.7272727273 0.75 177 Pete_Sampras_0011 3 0.5000000000 2.00 178 Phil_Cline_0001 3 0.4761904762 2.00 179 Phil_Cullen_0001 3 0.8421052632 0.00 180 Ralph_Goodale_0001 3 0.7407407407 0.50 181 Recep_Tayyip_Erdogan_0015 3 0.7727272727 0.00 182 Renee_Zellweger_0004 3 0.4736842105 1.50 183 Ricardo_Lagos_0024 3 0.8888888889 0.00 184 Richard_Cohen_0001 3 0.4761904762 1.00 185 Robert_Lange_0001 3 0.7272727273 0.75 186 Roberto_Lavagna_0001 3 0.4736842105 2.00 187 Roberto_Marinho_0001 3 0.8333333333 0.00 188 Rogelio_Ramos_0001 3 0.7727272727 0.00 189 Roh_Moo-hyun_0024 3 0.9090909091 0.00 190 Roman_Abramovich_0001 3 0.7272727273 0.75 191 Ruben_Studdard_0001 3 0.4761904762 2.00 192 Saeb_Erekat_0002 3 0.7916666667 0.00 193 Sebastian_Saja_0002 3 0.4444444444 1.00 194 Serena_Williams_0023 1 0.5000000000 1.00 195 Serena_Williams_0038 1 0.4090909091 1.75 196 Shanna_Zolman_0001 3 0.6250000000 1.00 197 Sharon_Osbourne_0003 3 0.5263157895 2.00 198 Shaul_Mofaz_0001 1 0.7037037037 1.00 199 Sherri_Coale_0001 3 0.7368421053 0.50 200 Spike_Jonze_0001 1 0.4736842105 2.00

37 Image name Mode Mode frequency IQR1 201 Stanislas_Wawrinka_0001 3 0.6296296296 1.00 202 Steve_Backley_0001 3 0.8095238095 0.00 203 Steve_Cutler_0001 3 0.7083333333 1.00 204 Steve_Lavin_0005 1 0.5000000000 2.00 205 Steven_Hatfill_0002 3 0.7142857143 1.00 206 Thaksin_Shinawatra_0002 3 0.7272727273 0.75 207 Thomas_OBrien_0005 3 0.5714285714 2.00 208 Thomas_OBrien_0009 3 0.5454545455 2.00 209 Tian_Liang_0001 3 0.7037037037 1.00 210 Tim_Allen_0004 1 0.6190476190 1.00 211 Tom_Cruise_0007 1 0.5714285714 2.00 212 Tom_Daschle_0015 3 0.5000000000 2.00 213 Tony_Blair_0040 3 0.6666666667 1.75 214 Tony_Blair_0073 3 0.4545454545 2.00 215 Tony_Blair_0080 3 0.8181818182 0.00 216 Tony_Blair_0106 3 0.3809523810 2.00 217 Tony_Blair_0114 3 1.0000000000 0.00 218 Tony_Elias_0001 3 0.7619047619 0.00 219 Trent_Lott_0010 3 0.7142857143 1.00 220 Venus_Williams_0008 3 0.7500000000 0.75 221 Vince_Gill_0002 1 0.8181818182 0.00 222 Vladimir_Putin_0020 3 0.6666666667 1.00 223 Wayne_Gretzky_0001 1 0.6842105263 2.00 224 William_Ford_Jr_0001 3 0.7894736842 0.00 225 William_Shatner_0001 1 0.6818181818 1.75 226 Yasser_Arafat_0007 1 0.619047619 1.00 227 Yu_Shyi-kun_0001 3 0.7368421053 0.50 228 Yuri_Malenchenko_0002 3 0.9629629630 0.00

1Interquartile range

38 D CrowdFlower task example

39 www.kth.se