<<

arXiv:1905.01347v2 [cs.LG] 4 Jun 2019 c fIaee.Cd n noain r vial at: available are annotations demograph- and the Code in study ImageNet. imbalances devel- the of of future for ics effects and for downstream models point We annotation of starting unbiased the bias. of as to opment subject work are this the annotation present inter- that so all note for groups, We fair account sectional not 29 is 27.11%. to framework 15 with model-driven aged presented subgroup males largest individu- and the 60, as faces of for appear of age 1.71% the 41.62% female, above that als as find appear We ILSVRC hierarchi- in ImageNet. ‘person’ of the category and cal ImageNet of Challenge subset the Recognition Visual (ILSVRC) of this Scale audit Large Using demographic ImageNet first 2012 the datasets. conduct image we large-scale framework, frame- in model-driven gen- and attributes a age apparent der introduce vision of annotation we automatic the of work, for work variety this wide In a tasks. for models within pretrain us frequently biases to is inherent it on given a important insights Such particularly new ImageNet, to dataset. lead the could within study contained demographic images the of not into attributes has investigation there comprehensive impact, a vision significant been computer its for Despite learning deep applications. in interest industry and cl iulRcgiinCalne(LVC [ (ILSVRC) Challenge Recognition Visual Scale aaaeo nls [ lexical English WordNet of the follows ImageNet vision. computer in Introduction 1. http://bit.ly/ImageNetDemoAudit o ehnclTr AT [ (AMT) Turk Mechanical Ama- a zon with contains through annotated and ImageNet collected search synsets, web-based comprehensive 21,841 concept. in images distinct 14,197,122 a expressing each h mgNtdtstuhrdi odo academic of flood a in ushered dataset ImageNet The mgNt[ ImageNet uiigIaee:TwrsaMdldie rmwr o An for Framework Model-driven a Towards ImageNet: Auditing 7 ,rlae n20,i aoia dataset canonical a is 2009, in released ], eorpi trbtso ag-cl mg Datasets Image Large-Scale of Attributes Demographic 15 [email protected] ,wihgop od into words groups which ], Abstract aelo nai,Canada Ontario, Waterloo, nvriyo Waterloo of University hi Dulhanty Chris 7 .TeIaee Large ImageNet The ]. 19 synsets ,held ], ed , 1 ere rmdt htaeifleta ntedcso fa of [ society decision of the values patterns with in of aligned not influential form are are but the model, that in data manifest from can learned biases These agate. enn.Wtotacncosefr oicroaediversi collection, incorporate data to in effort conscious a Without cerning. contains. it classes the ImageNet, composi- in the of images into of release study tion comprehensive the a following been not years prac- has ten there common the scrutinizing In and back tices. step a ther taking While applications, in practical merit in ImageNet. successful of proved has details this the away abstracted fectively classifi- Krizhevsky ILSVRC by work the Seminal in A used task. were learning. cation synsets deep 1,000 in explo- of interest an industry subset for and catalyst academic the of was sion 2017, to 2010 from annually ersnaino lc esn nohrcassmyla to ‘ lead of may representation under- classes biased an other a that in hypothesized persons They black of class. representation the in race of ance ‘ synset ball the [ of Cisse examples and prototypical Stock that criticism, discovered model of adversarial form a Using its as biases. on examples undesirable pretrained encode CNNs also that may evidence data some is there [ geNet, classifiers gender vision computer [ diagnosis [ instrumen- radiograph segmentation is chest instance CNNs as varied pretrained as of applications data in use tal new The training task. for new the point on starting task, the a uses desired as CNN weights the the pretrained where fit performed, to is of learning adjusted transfer subset and is ILSVRC network 2012 the the on ImageNet, trained those weights to with downloaded initialized is CNN pretrained a process: dard vision. computer in model preeminent the as network (CNN) neural convolutional deep the cemented event 2012 edradrca isshv enepsdi odem- word in exposed been have [ beddings biases racial and gender hslc fsrtn noIaee’ otnsi con- is contents ImageNet’s into scrutiny of lack This ef- have practitioners vision computer convention, By oa,wr ncmue iinlreyflosastan- a follows largely vision computer in work Today, oti mgso lc esn,dsiearltv bal- relative a despite persons, black of images contain ’ aelo nai,Canada Ontario, Waterloo, nvriyo Waterloo of University [email protected] lxne Wong Alexander 3 ,iaecpinn oes[ models captioning image ], neial biases undesirable basketball 17 ]. 4 .I h aeo Ima- of case the In ]. ’. a olc n prop- and collect can 2 notating ,adcommercial and ], tal. et [ 12 20 nthe in ] basket- .Age, ]. 9 and ] is e 20 ty ] Gender Gender Average Precision (%) Accuracy (%) Male Female All Male Female All 0-14 93.75 100.00 97.44 0-14 79.19 75.78 77.62 15-29 99.20 100.00 99.61 15-29 96.09 89.80 91.95 Age Age 30-44 100.00 99.11 99.76 30-44 100.00 91.72 95.99 Group Group 45-59 100.00 100.00 100.00 45-59 100.00 91.30 96.89 60+ 99.17 100.00 99.24 60+ 84.91 79.71 82.86 All 99.48 99.74 99.54 All 94.15 88.04 91.00

Table 1. Face detection model average precision on a subset of Table 3. Gender model binary classification accuracy on APPA- FDDB, hand-annotated by the author for apparent age and gender REAL test set

Gender Gender Mean Average Error (years) Accuracy (%) Male Female All Male Female All 0-14 8.26 9.37 8.77 I 100.00 98.31 99.12 15-29 3.24 3.65 3.57 II 100.00 97.14 98.86 Age 30-44 4.52 4.93 4.72 Fitzpatrick III 100.00 95.08 97.67 Group 45-59 4.43 5.50 4.81 Skin IV 100.00 86.11 92.31 60+ 7.70 9.48 8.40 Type V 100.00 68.79 83.70 All 5.13 5.29 5.22 VI 100.00 62.77 86.22 All 100.00 83.57 92.68 Table 2. Apparent age model mean average error in years on APPA-REAL test set Table 4. Gender model binary classification accuracy on PPB

This paper is the first in a series of works to build a each synset and measuring the lossless JPG file size. They frameworkfor the audit of the demographicattributes of Im- state that a diverse synset will result in a blurrier average ageNet and other large image datasets. The main contribu- image and smaller file, representative of diversity in appear- tions of this work include the introductionof a model-driven ance, position, viewpoint and background. This method, demographic annotation pipeline for apparent age and gen- however, cannot quantify diversity with respect to demo- der, analysis of said annotation models and the presenta- graphic characteristics such as age, gender, and skin type. tion of annotations for each image in the training set of the ILSVRC 2012 subset of ImageNet (1.28M images) and the 3. Methodology ‘person’ hierarchical synset of ImageNet (1.18M images). In order to provide demographic annotations at scale, there exist two feasible methods: crowdsourcing and 2. Diversity Considerations in ImageNet model-driven annotations. In the case of large-scale image Before proceeding with annotation, there is merit in con- datasets, crowdsourcing quickly becomes prohibitively ex- textualizing this study with a look at the methodology pro- pensive; ImageNet, for example, employed 49k AMT work- posed by Deng et al. in the construction of ImageNet. A ers during its collection [13]. Model-driven annotations use close reading of their data collection and quality assurance methods to create models that can pre- processes demonstrates that the conscious inclusion of de- dict annotations, but this approachcomes with its own meta- mographic diversity in ImageNet was lacking [7]. problem; as the goal of this work is to identify demographic First, candidate images for each synset were sourced representation in data, we must analyze the annotation mod- from commercial image search engines, including Google, els for their performance on intersectional groups to deter- Yahoo!, Microsoft’s Live Search, Picsearch and Flickr [8]. mine if they themselves exhibit bias. Gender [11] and racial [16] biases have been demonstrated 3.1. Face Detection to exist in image search results (i.e. images of occupa- tions), demonstrating that a more curated approach at the The FaceBoxes network [22] is employed for face de- top of the funnel may be necessary to mitigate inherent tection, consisting of a lightweight CNN that incorporates biases of search engines. Second, English search queries novel Rapidly Digested and Multiple Scale Convolutional were translated into Chinese, Spanish, Dutch and Italian us- Layers for speed and accuracy, respectively. This model ing WordNet and used for image retrieval. While was trained on the WIDER FACE dataset [21] and achieves this is a step in the right direction, Chinese was the only average precision of 95.50% on the Face Detection Data Set non-Western European language used, and there exists, for and Benchmark (FDDB) [10]. On a subset of 1,000 images example, Universal Multilingual WordNet which includes from FDDB hand-annotated by the author for apparent age over 200 languages for translation [6]. Third, the authors and gender, the model achieves a relative fair performance quantify image diversity by the average image of across intersectional groups, as show in Table 1. Gender Gender % of Dataset % of Dataset Male Female All Male Female All 0-14 4.06 3.01 7.08 0-14 3.35 1.66 5.00 15-29 27.11 23.72 50.83 15-29 24.63 16.95 41.58 Age Age 30-44 17.81 9.02 26.83 30-44 21.02 7.36 28.38 Group Group 45-59 8.48 5.08 13.56 45-59 16.87 3.70 20.57 60+ 0.91 0.80 1.71 60+ 3.02 1.44 4.47 All 58.38 41.62 100.00 All 68.89 31.11 100.00

Table 5. Top-level of ILSVRC 2012 ImageNet subset Table 7. Top-level statistics of ImageNet ‘person’ subset

% of Synset Annotated Male % of Synset Annotated Female % of Synset Annotated Male % of Synset Annotated Female ‘barracouta’ 94.24 ‘brassiere’ 88.59 ‘spree’ 100.00 ‘bombshell’ 100.00 ‘ballplayer’ 93.47 ‘golden retriever’ 88.29 ‘pitchman’ 100.00 ‘choker’ 100.00 ‘Windsor tie’ 93.29 ‘bikini’ 85.91 ‘counterterrorist’ 100.00 ‘dyspeptic’ 100.00 ‘gar’ 92.76 ‘maillot’ 85.90 ‘second’ 100.00 ‘comedienne’ 96.30 ‘tench’ 92.40 ‘tank suit’ 85.47 ‘Kennan’ 100.00 ‘cover’ 96.24 ‘rugby ball’ 91.96 ‘miniskirt’ 85.14 ‘chandler’ 100.00 ‘tempter’ 95.92 ‘barbershop’ 90.06 ‘cocker spaniel’ 83.82 ‘agnostic’ 100.00 ‘nymph’ 95.74 ‘bulletproof vest’ 89.33 ‘gown’ 83.62 ‘helmsman’ 100.00 ‘artist”s 95.05 ‘sax’ 89.16 ‘lipstick’ 82.81 ‘anti-American’ 100.00 ‘nymphet’ 93.84 ‘swimming trunks’ 88.36 ‘Maltese dog’ 82.06 ‘Girondist’ 100.00 ‘smasher’ 93.84 ‘assault rifle’ 88.12 ‘stole’ 81.64 ‘argonaut’ 100.00 ‘maid’ 93.05 ‘sturgeon’ 87.61 ‘cardigan’ 81.32 ‘oilman’ 100.00 ‘model’ 93.01

Table 6. Gender-biased synsets, ILSVRC 2012 ImageNet subset Table 8. Top-level statistics of ImageNet ‘person’ subset

3.2. Apparent Age Annotation on the Pilot Parliaments Benchmark (PPB) [4], a face dataset developed by Buolamwini and Gebru for parity in The task of apparent age annotation arises as ground- gender and skin type. Results for intersectional groups on truth ages of individuals in images are not possible to ob- PPB are shown in Table 4. The model performs very poorly tain in the domain of web-scraped datasets. In this work, for darker-skinned females (Fitzpatrick skin types IV - VI), we follow Merler et al. [14] and employ the Deep EXpecta- with an average accuracy of 69.00%, reflecting the disparate tion (DEX) modelof apparentage [18], which is pre-trained findings of commercial gender classifiers on the IMDB-WIKI dataset of 500k faces with real ages and in Gender Shades [4]. We note that use of this model in an- fine-tuned on the APPA-REAL training and validation sets notating ImageNet will result in biased gender annotations, of 3.6k faces with apparent ages, crowdsourced from an av- but proceed in order to establish a baseline upon which the erage of 38 votes per image [1]. As show in Table 2, the results of a more fair gender annotation model can be com- model achieves a mean average error of 5.22 years on the pared in future work, via fine-tuning on crowdsourced gen- APPA-REAL test set, but exhibits worse performance on der annotations from the Diversity in Faces dataset [14]. younger and older age groups. 3.3. Gender Annotation 4. Results We recognize that a binary representation of gender does We evaluate the training set of the ILSVRC 2012 sub- not adequately capture the complexities of gender or repre- set of ImageNet (1000 synsets) and the ‘person’ hierarchi- sent transgender identities. In this work, we express gender cal synset of ImageNet (2833 synsets) with the proposed as a continuous value between 0 and 1. When thresholding methodology. Face detections that receive a confidence at 0.5, we use the sex labels of ‘male’and‘female’ to define score of 0.9 or higher move forward to the annotation phase. gender classes, as training datasets and evaluation bench- Statistics for both datasets are presented in Tables 5 and 7. marks use this binary label system. We again follow Merler In these preliminary annotations, we find that females com- et al. [14] and employ a DEX model to annotate the gen- prise only 41.62% of images in ILSVRC and 31.11% in the der of an individual. When tested on APPA-REAL, with ‘person’ subset of ImageNet, and people over the age of 60 enhanced annotations provided by [5], the model achieves are almost non-existent in ILSVRC, accounting for 1.71%. an accuracy of 91.00%, however its errors are not evenly To get a of the most biased classes in terms of gen- distributed, as shown in Table 3. The model errs more on der representation for each dataset, we filter synsets that younger and older age groups and on those with a female contain at least 20 images in their class and received face gender label. detections for at least 15% of their images. We then cal- Given these biased results, we further evaluate the model culate the percentage of males and females in each synset and rank them in descending order. Top synsets for each [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- gender and dataset are presented in Tables 6 and 8. Top Fei. Imagenet: A large-scale hierarchical image database. ILSVRC synsets for males largely represent types of fish, In IEEE conference on computer vision and pattern recogni- sports and firearm-related items and top synsets for females tion, pages 248–255. IEEE, 2009. 1, 2 largely represent types of clothing and dogs. [8] L. Fei-Fei. Imagenet: crowdsourcing, benchmarking & other cool things. In CMU VASC Seminar, volume 16, pages 18– 5. Conclusion 25, 2010. 2 [9] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. Through the introduction of a preliminary pipeline for In Proceedings of the IEEE international conference on com- automated demographic annotations, this work hopes to puter vision, pages 2961–2969, 2017. 1 provide insight into the ImageNet dataset, a tool that is [10] V. Jain and E. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical report, Uni- commonly abstracted away by the computer vision com- versity of Massachusetts, Amherst, 2010. 2 munity. In the future, we will continue this work to create [11] M. Kay, C. Matuszek, and S. A. Munson. Unequal represen- fair models for automated demographic annotations, with tation and gender stereotypes in image search results for oc- emphasis on the gender annotation model. We aim to in- cupations. In Proceedings of the 33rd Annual ACM Confer- corporate additional measures of diversity into the pipeline, ence on Human Factors in Computing Systems, pages 3819– such as Fitzpatrick skin type and other craniofacial mea- 3828. ACM, 2015. 2 surements. When annotation models are evaluated as fair, [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet we plan to continue this audit on all 14.2M images of Im- classification with deep convolutional neural networks. In ageNet and other large image datasets. With accurate cov- Advances in neural information processing systems, pages erage of the demographic attributes of ImageNet, we will 1097–1105, 2012. 1 be able to investigate the downstream impact of under- and [13] F. Li and J. Deng. Imagenet: Where are we going? and over-representedgroups in the features learned in pretrained where have we been. In Conference on Computer Vision and CNNs and how bias represented in these features may prop- , 2017. 2 agate in to new applications. [14] M. Merler, N. Ratha, R. S. Feris, and J. R. Smith. Diversity in faces. arXiv preprint arXiv:1901.10436, 2019. 3 [15] G. A. Miller. Wordnet: a lexical database for english. Com- References munications of the ACM, 38(11):39–41, 1995. 1 [1] E. Agustsson, R. Timofte, S. Escalera, X. Baro, I. Guyon, [16] S. U. Noble. of oppression: How search engines and R. Rothe. Apparent and real age estimation in still im- reinforce racism. nyu Press, 2018. 2 ages with deep residual regressors on appa-real database. In [17] P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, 12th IEEE International Conference on Automatic Face & T. Duan, D. Ding, A. Bagul, C. P. Langlotz, et al. Deep learn- , pages 87–94. IEEE, 2017. 3 ing for chest radiograph diagnosis: A retrospective com- [2] L. Anne Hendricks, K. Burns, K. Saenko, T. Darrell, and parison of the chexnext to practicing radiologists. A. Rohrbach. Women also snowboard: Overcoming bias in PLoS medicine, 15(11):e1002686, 2018. 1 captioning models. In Proceedings of the European Confer- [18] R. Rothe, R. Timofte, and L. Van Gool. Deep expectation ence on Computer Vision, pages 771–787, 2018. 1 of real and apparent age from a single image without fa- cial landmarks. International Journal of Computer Vision, [3] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and 126(2-4):144–157, 2018. 3 A. T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, in neural information processing systems, pages 4349–4357, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, 2016. 1 et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, [4] J. Buolamwini and T. Gebru. Gender shades: Intersectional 2015. 1 accuracy disparities in commercial gender classification. In [20] P. Stock and M. Cisse. Convnets and beyond ac- Conference on Fairness, Accountability and Transparency, curacy: Understanding mistakes and uncovering biases. In pages 77–91, 2018. 1, 3 Proceedings of the European Conference on Computer Vi- [5] A. Clap´es, O. Bilici, D. Temirova, E. Avots, G. Anbarja- sion, pages 498–512, 2018. 1 fari, and S. Escalera. From apparent to real age: gender, [21] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A age, ethnic, makeup, and expression bias analysis in real face detection benchmark. In Proceedings of the IEEE con- age estimation. In Proceedings of the IEEE Conference on ference on computer vision and pattern recognition, pages Computer Vision and Pattern Recognition Workshops, pages 5525–5533, 2016. 2 2373–2382, 2018. 3 [22] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. [6] G. De Melo and G. Weikum. Towards a universal wordnet Faceboxes: A cpu real-time face detector with high accuracy. by learning from combined evidence. In Proceedings of the In IEEE International Joint Conference on Biometrics, pages 18th ACM conference on Information and knowledge man- 1–9. IEEE, 2017. 2 agement, pages 513–522. ACM, 2009. 2