On Imagenet Roulette and Machine Vision
Total Page:16
File Type:pdf, Size:1020Kb
Humans Categorise Humans: on ImageNet Roulette and Machine Vision Olga Goriunova Published in Donaufestival: Redefining Arts Catalogue, April 2020. In September 2019, artist Trevor Paglen and researcher Kate Crawford were getting through the final week of their ImageNet Roulette being publicly available on the web. ImageNet Roulette is part of their exhibition “Training Humans,” which ran between September 2019 and February 2020 at the Foundazione Prada. With this exhibition, collaborators Crawford and Paglen queried the collections of images and the processes used to train a wide range of machine learning algorithms to recognise and label images (known as “machine vision”).1 Crawford and Paglen want to draw attention to how computers see and categorize the world: seeing and recognizing is not neutral for humans (one need only think about “seeing” race or gender), and the same goes for machines. Machine vision, by now a large and successful part of AI, has only taken off in the last decade. For “machines” to recognize images, they need to be “trained” to do so. The first major step for this to happen is to have a freely available and pre-labelled very large image-data-set. The first and largest training dataset that Crawford and Paglen focused on and which they derived the title of their project from is ImageNet, which is only ten years old. ImageNet Roulette is a website or an app that allows one to take a selfie and run it through ImageNet. It uses an “open-source Caffe deep-learning framework … trained on the images and labels in the ‘person’ categories.”2 One is “seen” and “recognized” or labelled. I took my picture sitting with my laptop in my bed, in pyjamas and with bad lighting, and was labelled “person, individual, someone, somebody, mortal, soul => unwelcome person, persona non grata => disagreeable person => creep, weirdo, weirdee, wierdy, spook.” So far, so good: I don’t mind being labelled weird: perhaps it’s part of my intellectual appeal. I ran it again in the daytime, with the same result. It quickly became clear though, that the labelling carried substantially more serious consequences for people of color. Guardian journalist Julia Carrie Wong was labelled “gook, slant-eye.” She wrote: “below the photo, my label was helpfully defined as “a disparaging term for an Asian person (especially for North Vietnamese soldiers in the Vietnam War),”3 My younger female southern European PhD student told me she was labelled a “virgin,” and when she ran the app again, a “mulatto,” but her British male Black friend was labelled simply a “rapist.” This verdict of ImageNet Roulette is indicative of bias and an in-built racism, the evidence of which accompanies the developments in machine vision and image recognition technologies. In 2009, an HP face-tracking webcam could 1 For the research paper, see: Kate Crawford and Trevor Paglen, “Excavating AI: The Politics of Images in Machine Learning Training Sets”; 19 September, 2019; 2 Ibid. 3 Julia Carrie Wong, “The viral selfie app ImageNet Roulette seemed fun—until it called me a racist slur,” The Guardian, 18 September 2019, https://www.theguardian.com/technology/2019/sep/17/imagenet-roulette-asian-racist-slur- selfie not follow a Black person.4 In 2010, Microsoft’s Kinect motion-sensing camera did not work so well with dark-skinned users.5 In 2015, Google Photos classified Black men as gorillas (a “feature” Google fixed by removing “gorilla” from its set of image labels).6 Joy Buolamwini, a Black researcher, had to wear a white mask to test her own software.7 Face recognition, action recognition, and image categorization are different technical procedures whose results converge when it comes to privileging white skin and Caucasian features. It is not dissimilar to discrimination in other fields: voice-command car systems often do not respond to female drivers, but are quick to obey male voices, even when they come from the passenger seat.8 How do machines pick up racism and sexism? How do they learn to be nasty? By now, there are some standard responses to these questions. Machine learning is a branch of computer science that develops algorithms that improve their performance with experience. Neural networks are one example of machine learning algorithms, which are said to be modeled on the human neural network and capable of handling complexity. They need to be trained, either by a human, or in an automated way, on a dataset; after training they are able to start making independent decisions. For instance, one runs a neural network through a dataset of many images of haircuts, teaching the network to recognize different types of haircuts. This is done by saying repeatedly, for example, “this is a bob,” and “this is a shaved head.” The neural network can then start to recognize haircuts on its own. What matters here are the model and the dataset. The model might weigh certain qualities of hair and kinds of haircuts higher than others, making it, for instance, unable to differentiate between types of haircuts or, say, curly hairstyles. In this way, the training dataset does not become diverse enough; or it is annotated in such a way that makes it biased towards certain kinds of hair. If curly hair is not included in the dataset, or included with negative labels, the new outputs would sustain the “bias.” Beyond models and datasets, computer vision’s infrastructures—which include server architectures, the labor of computer scientists and data workers, knowledge infrastructures that spread beyond one discipline and format, and many other elements and processes—can all carry “bias.” As ImageNet Roulette made headlines around the world, The Photographers’ Gallery in London—and in particular the curator of Digital Programmes Katrina Sluis, a long standing scholar of the transformations that algorithms and AI bring to the field of photography and visual culture, and researcher Nicolas Malevé— organized a symposium on computational images (as well as a birthday party for ImageNet), securing the participation of Fei-Fei Li, Professor at the University of Stanford, and one of the creators of ImageNet. Li’s lecture, celebratory and focused on the history and effort that went into ImageNet, was, in its framing, orientation points and identification of inspiration, illuminating. 4 “HP looking into claim webcams can’t see black people.” CNN, 24 December 2009; http://edition.cnn.com/2009/TECH/12/22/hp.webcams/index.html 5 “Is Microsoft’s Kinect racist?” PCWorld, 4 November 2010. https://www.pcworld.com/article/209708/Is_Microsoft_Kinect_Racist.html 6 Tom Simonite, “When it comes to gorillas, Google Photos remains blind,” Wired, 11 January 2018; https://www.wired.com/story/when-it-comes-to-gorillas-google-photos-remains-blind/ 7 Joy Buolamwini, “InCoding—in the beginning,” Medium, 16 May 2016; https://medium.com/mit-media-lab/incoding-in-the-beginning-4e2a5c51a45d 8 Sharon Silky Carty, “Many Cars Tone Deaf To Women's Voices,“ Autoblog.com, 31 May 2011, https://www.autoblog.com/2011/05/31/women-voice-command-systems ImageNet is an image dataset that helped make a breakthrough in computer vision. Without a large dataset of images, no major work on the automation of vision would be possible. Fei-Fei Li started working on ImageNet in 2009, as she recounted, against advice from senior colleagues. ImageNet’s images come from Flickr, and were automatically harvested by the millions. No computer vision breakthrough would have been possible without social media and mass uploads of user-generated images, free to use. At first, Princeton undergraduate students were paid to categorize and label images. It was expensive and slow work. At the same time, Amazon’s “Mechanical Turk” was launched. What has become known as the marketplace for outsourcing the tedious and painstaking labor undergirding digital culture to countries where people have to accept earning as little as 0.02 USD per task, was also key to the success of machine vision. 50,000 workers in over a hundred countries undertook the labor of sifting through 160 million candidate Flickr-derived images, and annotating 14 millions of them, by using the WordNet semantic structure.9 WordNet is a lexical database developed from 1985 onwards and used for automated text analysis (machine translation, information retrieval, and other tasks). The database works as a “conceptual dictionary,” grouping words into sets of synonyms (synsets), giving short definitions and examples; it further organizes them into hierarchies, going from the specific to the more abstract. WordNet is notorious for its bias: query it for “woman” and you get the following.10 • S: (n) woman, adult female (an adult female person (as opposed to a man)) "the woman kept house while the man hunted" • S: (n) woman (a female person who plays a significant role (wife or mistress or girlfriend) in the life of a particular man) "he was faithful to his woman" • S: (n) charwoman, char, cleaning woman, cleaning lady, woman (a human female employed to do housework) "the char will clean the carpet"; "I have a woman who comes in four hours a day while I write" • S: (n) womanhood, woman, fair sex (women as a class) "it's an insult to American womanhood"; "woman is the glory of creation"; "the fair sex gathered on the veranda" Needless to say, the entry on “man” is three times longer and does not define the man exclusively in relation to sexual and household services to the “opposite sex.” Once WordNet’s semantic structure was taken on as the system to label images, it was up to the workers sourced through the Mechanical Turk to decide which categories to “image.” As Fei-Fei Li explained, the process was entirely automated.