Exploring Data Pipelines through the Process Lens: a Reference Model for Computer Vision

Agathe Balayn Bogdan Kulynych Seda Gurses¨ TU Delft EPFL TU Delft [email protected] [email protected] [email protected]

Abstract Recent exceptions [23, 26,7, 10, 16] hint at another direc- tion, that explores not only the datasets themselves but con- Researchers have identified datasets used for training siders the data pipelines and actors behind their creation computer vision (CV) models as an important source of and use more broadly. In this paper, we take inspiration hazardous outcomes, and continue to examine popular CV from these works and explore a process-oriented lens on datasets to expose their harms. These works tend to treat data pipelines. Such a lens may encourage researchers and datasets as objects, or focus on particular steps in data practitioners to be more rigorous in the production and the production pipelines. We argue here that we could further analysis of datasets vis a vis their potential societal impact. systematize our analysis of harms by examining CV data Most importantly, a process lens will enable to recognize pipelines through a process-oriented lens that captures the and capture the following: creation, the evolution and use of these datasets. As a step Datasets as complex, living objects. Datasets are not in- towards cultivating a process-oriented lens, we embarked ert but living objects. They are regularly updated in produc- on an empirical study of CV data pipelines informed by the tion to improve models, while they are often re-purposed field of method engineering. We present here a preliminary and adapted for new applications. Moreover, datasets inter- result: a reference model of CV data pipelines. Besides ex- act with each other, particularly through models, e.g. they ploring the questions that this endeavor raises, we discuss are composed using transfer learning [19] such as for im- how the process lens could support researchers in discov- proving facial recognition systems meant to be applied in a ering understudied issues, and could help practitioners in wide variety of contexts [25]. However, while datasets can making their processes more transparent. create harms at each phase of their life, academic works on datasets typically are ambiguous about these phases and seem to be based only on their first release. 1. Introduction The pipeline as a vector of harms. Not only the processes of the pipeline that follow the first release might raise is- Training data can have a considerable impact on the out- sues, but also the interplay between these processes and the puts of a machine learning system, by affecting its accu- ones leading to the first release, as it defines the potential racy, or causing undesirable or even harmful effects when errors of a model. For instance, in Detroit, following harm-

arXiv:2107.01824v1 [cs.CV] 5 Jul 2021 the system is deployed in applications that act in the world. ful mistakes of a facial recognition system in deployment, In response to these harms, machine-learning and in- the police has decided to only apply it to still images, as it is terdisciplinary researchers who tackle the societal impact closer to the training data collected in a static setting in de- of computer vision typically a) explore characteristics of velopment, which shows how some harms can occur due to datasets with respect to their potentially harmful impacts on problematic mismatches between processes in deployment society [8, 28, 22], e.g., taxonomies with offensive labels, and development phases [1]. Some existing works have often resulting from uncritical practices [2]; b) identify and already investigated the impact of pipeline processes such mitigate harms by adapting single, discrete processes in the as rotation, cropping, and other pre-processing on models’ data pipeline, e.g. filtering out the offensive and non-visual accuracy [11, 15, 34], although not necessarily on societal labels of the taxonomy, [33,2, 17]; c) propose techniques to harms [8]. Nonetheless, little attention was given to study- make datasets more transparent [12], to encourage analysts ing the impact of sequences of processes. to be more accountable and reflective of potential harms. Reasoning behind data pipeline design. The lack of These forms of enquiry presuppose considering datasets transparency about the design of a dataset’s pipeline (see ex- as objects with configurable and conceivable properties. amples in Appendix A) makes it challenging for researchers

1 to analyse the procedural harms or downstream harms of be too generic for our desired level of detail. However, the models training on a dataset. Such transparency would re- method engineering formalisms can alleviate these issues veal not only the processes themselves, but also the reason- as they support processes of different granularities, and the ing behind them. For instance, while it is valuable to study interactions between the processes can be represented in a datasets in-depth, it can be intractable to analyse each label map without necessarily enforcing order. and image due to the scale. Documentation of the reasoning Thus, we employed a mix of methods from the different behind the choices of label ontologies, and the process of disciplines. The details can be found in Appendix B. In collecting corresponding images, would improve the analy- short: 1) We did a systematic literature review that presents sis. E.g., knowing that WordNet was used to select the la- computer vision datasets (51 publications) or specific pro- bels of ImageNet [9] enables us to directly reflect on harms cesses to build datasets (16 publications) selected out of 220 from the point of view of the taxonomy [8], and knowing relevant papers found using different search methods. 2) We that images were collected from country-specific search en- identified and noted the main processes within each publi- gines enables us to foresee the cultural skews in label repre- cation. 3) By comparing each description, we iteratively sentations [28]. Such matters are not consistently included identified common processes, refined their granularity, and in academic works, and they are not always explicitly asked aggregated them into meaningful method processes. 4) We for in documentation methodologies [12]. described each of them with a relevant formalism, and drew Our contributions. In this paper, we devise a systematic the maps of each dataset pipeline. 5) We searched for grey approach to study the data pipelines in computer vision. For areas and are in the process of conducting expert interviews this, we combine techniques from the development of refer- to further validate and refine our reference model, and to ence models, method engineering, and process engineering. model processes that are not objects of scientific publica- These are well-established, intertwined fields [31] that pro- tions (e.g. processes for serving data in deployment). pose methods to capture complex sets of processes. Using these techniques, we develop an empirically informed ref- 3. Results: Reference model of data processes erence model of data pipelines in computer vision. We then In this section, we present our formalism of the compo- reflect on its potential usefulness, as well as the possible nents of the reference model summarized in Figure 1, and limitations of this approach for creating transparency and provide examples of its application. accountability for practitioners and researchers. 3.1. Products 2. Methodology The products are the essential components of the datasets. Each product is defined by an object (e.g. an im- Our objective is to build a reference model of the data age, a set of images, an image-annotation pair, etc.) and a pipeline in machine learning-based computer vision tasks. set of properties. Based on the literature review, we have The literature on method, process and reference-model en- identified the following types of products: gineering identifies the following requirements towards ref- Label-related products. A dataset comprises a set of tar- erence models: representativity of the pipeline used in prac- get labels, often organized into a hierarchical taxonomy, tice at an abstraction level high enough for including com- with higher levels of the taxonomy representing more ab- prehensively each of their processes [21, 13,4], modularity stract concepts (e.g. animal) and lower levels representing to easily add new processes [14,4], clarity for individu- more fine-grained concepts (e.g. types of animals). The als interested in different parts of the pipeline [21,4], and main properties considered in the reviewed literature are the sufficient level of detail to enable actionable reporting, and size (number of labels) and the number of levels in the hi- reflection on potential ethical issues. erarchy. Many properties of labels are currently implicit, Following these requirements, we build: a high-level ref- whereas other properties, like where the taxonomy origi- erence model that serves as an ontology of our main con- nates from, can be relevant for capturing potential harms. cepts; a set of low-level models of each identifiable process Image-related products. A dataset typically contains a within existing pipelines; a set of process maps reflecting set of images. The size of the set, the content of the images the pipelines of the main computer vision practices. and especially its diversity are put forward as main proper- Populating the models and maps. The methods from ties in the reviewed literature. Additional properties can be each of the mentioned fields have some advantages and lim- considered, especially the metadata attached to the images itations. Process patterns in process engineering allow for (e.g. author of the images, etc.), physical characteristics of a fine-grained description of the processes, but generally the images such as their quality, resolution, etc., and the impose an order on these processes, which does not cor- distribution of these characteristics across the set. respond to the computer vision pipelines we analysed. Ref- Annotation-related products. These products can be erence models do not necessarily impose an order, but can thought as a list of tuples, each composed of an image and

2 Pipeline Label-related Section start Process map chunks 1 Set of strategy products hyperparameters Development Image-related Strategy a products Intention selection chunks Strategy Textual Aggregate guideline Intention 1 Intention X Aggregate description guideline Annotation- Inference 1 1 2..* data chunks Guideline has as input related products Strategy selection 2..* Situation guideline Strategy b Product Feedback description has as output Chunk 1 Atomic loop chunks Strategy Intention 2 Intention Process end chunks description related to 2 Atomic Descriptor 1 Property Object Intention 1..*1 Strategy y Task chunks 1 Figure 1. UML diagram describing the main components of our reference model formalism, and activity diagram for the process map. Relationships: association, ♦ aggregation (i.e. “part-of”),  composition, B inheritance (i.e. “is-a”). one or multiple annotations. The properties put forward are compose individual situated methods. We develop these often the number and quality of annotations per image. maps to materialize the process lens mentioned in the sec- tion 1, and to evaluate its potential and limitations. 3.2. Chunks Intuitively, an intention-strategy map represents a se- Borrowing a standard term from method engineering, we quence of chunks. Formally, the intention-strategy map is refer to the processes of the data pipeline as chunks. They an ordered sequence of intentions connected by the strate- can be composed: a chunk can consist of multiple lower- gies to realize them (see Figure 1), each intention-strategy level chunks. We define 3 major granularity levels. 1) pair referring to a chunk guideline. Attached to the se- The pipeline level—the main sections of the data pipeline. quence are sections. Each section consists of the initial and These are the development of training data, the processes subsequent intention of each of the chunks, and the strategy of improving and updating the training data in deployment to realize the subsequent intention, i.e. the description of (feedback loops), and the processing of data at inference how the chunk fulfills its intention. Additionally, the sec- time. 2) The process level—the individual activities taking tion also contains an intention selection guideline specify- place in the chunks of the pipeline level (e.g., data collec- ing the reason to go from the first intention to the second, tion). 3) The task level—the sub-activities required to con- and a strategy selection guideline specifying the reason for duct the activities of the prior level (e.g. query preparation choosing this particular strategy to perform this intention. and image crawling for data collection). These chunks can As an illustration, Figure 2 provides an example of a seg- be further divided into sub-tasks when necessary. ment of the process map for the ImageNet creation process. Each chunk has an associated guideline that defines the We identified multiple chunks and process maps in our activity it represents in relation to the products; and a de- literature review, which shows that a plurality of sequences scriptor that uniquely identifies it. A guideline is composed exists. This suggests that there is no standardized data of: i) an intention description, i.e. the goal of the chunk, ii) pipeline in computer vision. The process-level chunks are a situation description, i.e. the input and output products of not always the same, and the order might differ widely or the chunk, iii) a strategy description that describes the activ- may not be reported at all. The data-processing chunks, for ity textually (i.e. on the finest granularity), or indicates the example, happen either after collection, after annotation, or composition of a sequence of lower level chunks to perform after filtering. The task-level chunks and their parameters the activity, iv) a set of strategy hyperparameters allowing may also vary between the training and deployment phases. to be precise about the design choices in the strategy. While both may impact the outputs of the model, they are Based on the literature review, we have identified the fol- often not clearly specified in the papers. lowing process-level chunks: label definition, data collec- tion, data annotation, data filtering, data processing, data 4. Discussion augmentation, data splitting, and product refinement. We provide a short description of these, and some task-level Introducing processes via the formalism presents advan- chunks that we have identified in Appendix C. tages for researchers and practitioners of computer vision. 3.3. Process maps 4.1. Insights for researchers Each dataset is created in a pipeline that uses different Identification of knowledge gaps. Our formalism en- chunks, in different orders, at different granularities. Com- ables to systematically analyze typical processes employed posing chunks in various ways impacts the final dataset in computer vision across tasks. This unifying lens un- products and the outputs of a resulting model. We formalise covers the grey areas that escape scrutiny. In fact, we be- the overall process, i.e. the connections between chunks, lieve that only a fraction of existing process chunks have through intention-strategy maps and their sections, as in- been analyzed on the subject of ethical considerations and spired from method engineering that assembles chunks to downstream negative consequences [24]. For example, our

3 Pipeline Development Legend chunks Intention selection guidelines chunk Process Label definition Data collection are not mentioned. chunks ontology-based web-based search Intention selection guidelines are strategy not explicit, but inferable. Ontology Setup crawling Query strategy Label selection Image crawling Task retrieval configurations preparation selection chunks WordNet all synsets of WordNet not explicited synsets query expansion multiple search engines existing description of the world meaningful concepts not explicited increase number of results rich source of images Figure 2. Example segments of the process map associated to the development of ImageNet [9]. framework highlighted an understudied issue: the papers chunks may help practitioners to easily identify the chunks we reviewed rarely report on data processing chunks—yet of a new pipeline, and to build on existing documentation. the interaction of these chunks with other pipeline chunks It may also encourage them to share ethical considerations can potentially have a significant impact. For instance, corresponding to chunks across pipelines. cropping images after labeling might exclude the visual information relevant to the label, making the model learn 4.3. Reflections and limitations wrong associations, a matter that seems to not have been Despite potential advantages, the feasibility of building studied previously. Our formalism allows to surface this is- a satisfying reference model for systematizing the analy- sue as it makes apparent the related chunks, and the current sis of data pipelines for harms requires further discussion. lack of reporting on them. Most papers we studied, and our preliminary interviews Process details. Applying our formalism also uncovers with practitioners, have shown a divergence in data pipeline what chunks, metrics, and intentions the researchers deem design, including lack of common chunks or order of pro- worth documenting in the publications. This can guide re- cesses. This has parallels in software engineering. Prior searchers towards identifying the chunks that should merit work in method engineering recognizes that all projects are more detailed descriptions. For example, while a few in- different and cannot be supported by a single method, but terdisciplinary publications mention issues with the label- instead adapted methodological guidance should be pro- related products, less than 10% of the reviewed papers were posed [4]. However, the divergences in CV data pipelines found to refer to some of the properties of these products, pose challenges in fulfilling our requirements towards rep- e.g. only the SUN database [32] and MS-COCO [20] out- resentativity, comprehensiveness, and clarity. This raises a line their reasoning around the completeness of the set of number of questions. For example: Can we develop a refer- labels employed. As for data collection and filtering, only ence model that would capture all these messy or idiosyn- the authors of Tiny Images [30] make transparent the po- cratic processes? Would its use in production be tedious tential biases in images retrieved from search engines, and and easily out of date? Could modularity help in incorporat- include a study of the dataset noise with and without fil- ing further processes and details? Would such a model give tering these images. A more rigorous documentation could sufficient structure to the reporting of the data pipelines for support a more systematic analysis of potential issues. purposes of increased transparency and accountability? Finally, our primary interest lies in exploring the use 4.2. Insights for practitioners of such formalisms in systematizing the identification of Supporting developers or auditors in making processes harms that may stem from data pipelines and surrounding surrounding data more transparent and structured with our processes. Whether such use of a reference model would aid formalism could help with capturing or fostering reflec- in more effective and efficient identification and action on tion around potential issues. Compared to previous frame- harms in data pipelines requires further study. Other fields works [12], our formalism enables to report on the entire such as information retrieval have developed a common lan- pipeline surrounding the dataset instead of solely the dataset guage and benchmarks to study their systems of interest or a subset of processes executed to develop it. For instance, and foster communication between industry, academia and it outlines both the development and deployment pipelines, governments, enabling progress on their respective prob- which makes it easier to identify potential differences in lems [27]. Whether this would also apply to computer vi- terms of chunk, chunk order, and chunk hyperparameters sion and analysis of associated harms is yet to be seen. between them. These differences can lead to distribution Our empirical analysis of data pipelines in this paper shift, resulting in misclassifications, unfairness in outputs, has illustrated some of the envisioned benefits of a process- or non-robustness, as the Detroit police example showed. oriented lens. The sole fact that the analysis has also raised The body of documented chunks of an organization can the diversity of questions above hints at the richness of this grow over time. The identical format and reusable nature of lens, that we plan to investigate in the future.

4 References [15] Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- [1] Bobby Allyn. ’the computer got it wrong’: How facial tions. In International Conference on Learning Representa- recognition led to false arrest of black man. https: tions, 2018.1 / / www . npr . org / 2020 / 06 / 24 / 882683463 / [16] Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Den- the- computer- got- it- wrong- how- facial- ton, Christina Greer, Oddur Kjartansson, Parker Barnes, recognition- led- to- a- false- arrest- in- and Margaret Mitchell. Towards accountability for machine michig?t=1616003785339, 2020.1 learning datasets: Practices from software engineering and [2] Abeba Birhane and Vinay Uday Prabhu. Large image infrastructure. FAccT, 2021.1 Proceed- datasets: A pyrrhic win for computer vision? In [17] Eun Seo Jo and Timnit Gebru. Lessons from archives: Strate- ings of the IEEE/CVF Winter Conference on Applications of gies for collecting sociocultural data in machine learning. In Computer Vision, pages 1537–1547, 2021.1 Proceedings of the 2020 Conference on Fairness, Account- [3] Philipp Blandfort, Tushar Karayil, Jorn¨ Hees, and Andreas ability, and Transparency, pages 306–316, 2020.1 Dengel. The focus–aspect–value model for predicting sub- [18] Muhammad Haris Khan, John McDonagh, Salman Khan, jective visual attributes. International Journal of Multimedia Muhammad Shahabuddin, Aditya Arora, Fahad Shahbaz Information Retrieval, 9(1):47–60, 2020.6 Khan, Ling Shao, and Georgios Tzimiropoulos. Animalweb: [4] Sjaak Brinkkemper. Method engineering: engineering of in- A large-scale hierarchical dataset of annotated animal faces. formation systems development methods and tools. Infor- In Proceedings of the IEEE/CVF Conference on Computer mation and software technology, 38(4):275–280, 1996.2, Vision and Pattern Recognition, pages 6939–6948, 2020.7 4 [19] Xuhong Li, Yves Grandvalet, Franck Davoine, Jingchun [5] Mark Buckler, Suren Jayasuriya, and Adrian Sampson. Re- Cheng, Yin Cui, Hang Zhang, Serge Belongie, Yi-Hsuan configuring the imaging pipeline for computer vision. In Tsai, and Ming-Hsuan Yang. Transfer learning in computer Proceedings of the IEEE International Conference on Com- vision tasks: Remember where you come from. Image and puter Vision, pages 975–984, 2017.7 Vision Computing, 93:103853, 2020.1 [6] Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K Wong, [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, and Qi Wu. Cops-ref: A new dataset and task on composi- P Perona, D Ramanan, P Dollar,´ and C L Zitnick. Microsoft tional referring expression comprehension. In Proceedings coco: Common objects in context. In European conference of the IEEE/CVF Conference on Computer Vision and Pat- on computer vision, pages 740–755. Springer, 2014.4 tern Recognition, pages 10086–10095, 2020.7 [21] C Matthew MacKenzie, Ken Laskey, Francis McCabe, Pe- [7] Kate Crawford and Vladan Joler. Anatomy of an ai system. ter F Brown, Rebekah Metz, and Booz Allen Hamilton. Ref- Retrieved September, 18:2018, 2018.1 erence model for service oriented architecture 1.0. OASIS [8] Kate Crawford and Trevor Paglen. Excavating ai: the politics standard, 12(S 18), 2006.2 of training sets for machine learning. Excavating AI (www. [22] Nicolas Maleve.´ On the data set’s ruins. AI & SOCIETY, excavating. ai), 2019.1,2 pages 1–15, 2020.1 [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [23] Miceli Milagros, Tianling Yang, Laurens Naudts, Martin and Li Fei-Fei. Imagenet: A large-scale hierarchical image Schuessler, Diana Serbanescu, and Alex Hanna. Docu- database. In 2009 IEEE conference on computer vision and menting computer vision datasets: An invitation to reflexive pattern recognition, pages 248–255. Ieee, 2009.2,4 data practices. In ACM Conference on Fairness, Account- [10] Emily Denton, Alex Hanna, Razvan Amironesei, Andrew ability, and Transparency (ACM FAccT), Date: 2021/03/03- Smart, Hilary Nicole, and Morgan Klaus Scheuerman. 2021/03/10, Location: Virtual Event, Canada, 2021.1 Bringing the people back in: Contesting benchmark machine [24] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M learning datasets. arXiv preprint arXiv:2007.07399, 2020.1 Bender, E Denton, and A Hanna. Data and its (dis) con- [11] Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig tents: A survey of dataset development and use in machine Schmidt, and Aleksander Madry. Exploring the landscape of learning research. arXiv preprint arXiv:2012.05345, 2020.3 spatial robustness. In International Conference on Machine [25] Chuan-Xian Ren, Dao-Qing Dai, Ke-Kun Huang, and Zhao- Learning, pages 1802–1811. PMLR, 2019.1 Rong Lai. Transfer learning of structured representation for [12] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- face recognition. IEEE Transactions on image processing, nifer Wortman Vaughan, Hanna Wallach, Hal Daume´ III, 23(12):5440–5454, 2014.1 and Kate Crawford. Datasheets for datasets. arXiv preprint [26] Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Di- arXiv:1803.09010, 2018.1,2,4 ana Akrong, Praveen Kumar Paritosh, and Lora Mois Aroyo. [13] Mahdi Fahmideh Gholami, Pooyan Jamshidi, and F. Shams. ” everyone wants to do the model work, not the data work”: A procedure for extracting software development process Data cascades in high-stakes ai. 2021.1 patterns. In UKSim European Symposium on Computer [27] TREC Conference series. Text retrieval conference - Modeling and Simulation, pages 75–83. IEEE, 2010.2 overview. https://trec.nist.gov/overview. [14] Brian Henderson-Sellers and Jolita Ralyte.´ Situational html, 2019.4 method engineering: state-of-the-art review. Journal of Uni- [28] Shreya Shankar, Yoni Halpern, E Breck, J Atwood, J Wil- versal Computer Science, 2010.2 son, and D Sculley. No classification without representation:

5 Assessing geodiversity issues in open data sets for the de- veloping world. arXiv preprint arXiv:1711.08536, 2017.1, 2 [29] Ranjita Thapa, Cornell Tech, Serge Belongie, and Awais Khan. The fgvc plant pathology 2020 challenge dataset to classify foliar diseases of apples.7 [30] Antonio Torralba, R Fergus, and W T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.4 Figure 3. Dataset creation pipeline from [3]. [31] Robert Winter and Joachim Schelp. Reference modeling and method construction: a design science perspective. In Pro- ceedings of the 2006 ACM symposium on Applied computing, pages 1561–1562, 2006.2 [32] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.4 [33] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets: Filtering and balanc- ing the distribution of the people subtree in the imagenet hi- erarchy. In Proceedings of the 2020 Conference on Fairness, Figure 4. Dataset creation pipeline from a course on Accountability, and Transparency, pages 547–558, 2020.1 machine learning https://developers.google.com/ [34] Stephan Zheng, Yang Song, T Leung, and I Goodfellow. Im- machine-learning/data-prep/process?hl=fr proving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4480–4488, 2016.1

A. The Art of Sketching Data Pipelines It is difficult to even find sketches of data pipelines in publications Figure 3. They are easier to find in grey materi- als of companies, such as for Figure 6, Google Fig- ure 4, and DynamAI Figure 5. We show examples of these sketches below. The main observations one can make out of these sketches is that they are very diverse, without any unifor- mity. They do not all use the same level of granularity to talk about the data pipelines, they do not all use the same vo- cabulary, and they refer to various different processes, often mentioning the training data development process but not the deployment processes. This variability in the sketches does not allow to reflect on the potential harms of the pipelines in a global way, that Figure 5. Dataset creation pipeline from Dynam.AI https: / / www . dynam . ai / computer - vision - projects - can be re-used to talk about different pipelines. management-part-1/, B2B company that implements com- puter vision solutions. B. Details about the methodology B.1. Systematic review of the literature ICLR and associated workshops in the general field of ma- We systematically surveyed the computer science publi- chine learning and in the field of computer vision more cations that present computer vision datasets, or that pro- specifically –the assumption here is that in the more general pose methods pertaining to the creation of datasets. conferences, some computer vision publications might ap- For that, we went through the list of publications of the pear, and some papers might mention general data pipelines following top conferences CVPR, ECCV, ICCV, NeurIPS, with example in computer vision–, for the years 2015 to

6 Figure 6. Dataset creation pipeline from Amazon https : / / aws . amazon . com / blogs / iot / sagemaker - object - detection-greengrass-part-1-of-3/

June 2020. Then we proceeded to snowball sampling from their collection process differs from the other works. the references of the papers initially collected. Concerning computer vision papers that relate to pro- cesses of the pipelines, we solely found papers that studied We only retained publications related to images (we ex- and proposed ways to optimize the crowdsourcing compo- cluded video dataset papers and 3D based works), that nents of the pipelines amounting to the number of 16. Only touch upon classification or segmentation tasks, as the data one paper [5] seems to study the impact of the image signal pipelines for other types of data and tasks vary even more processing pipeline on the accuracy and energy consump- than just for these image tasks. It amounted to around 270 tion of the following learning pipeline. papers. We filtered out other types of tasks such as referring expression comprehension [6], image captioning and visual B.2. Interviews question answering because the data pipelines of many of these datasets are not described in details in the papers, and As for the interviews, we plan to interview two groups are not always published in the field of computer vision but of individuals: researchers working in the field of com- also natural language processing. The set of applications of puter vision, and practitioners who develop computer vision this pool of publications (around 220 papers) is large. We pipelines. In both cases, our goal is to identify limitations of further scoped the study by selecting datasets and research the initial reference model and its population that we drew publications that have as object of study objects (things), from the literature review. For that, we first ask the par- scenes and stuffs, or persons (mainly faces and emotions), ticipants to talk about the pipelines they know of, and to as they are one of the main current interests in computer vi- describe them (this way, we can also identify the processes sion research. We filtered out datasets for autonomous driv- they deem worth reporting). We then present to the inter- ing and aerial views as they consist mostly of road and field view participants a drawing of the reference model, and we views, which is a more restricted set of classification labels ask them to discuss it critically, i.e. to identify chunks that and hence present mostly a subset of processes retrieved could be missing or hyperparameters that could be different, from the publications we selected (and there are –too– many as well as to re-order the chunks based on their experience. of them), and a subset of potential harms. Although works This way, we can iteratively improve our reference model relying on these types of data and / or application are not in- and validate it. Finally, we interrogate them around harms cluded and we assume that many points we learn from our they know of in relation to the different processes. survey will apply to them, it would surely be beneficial to repeat our analysis on these datasets in the future, as even C. Reference model -further details more harms might arise from them. • Label definition. This chunk consists in defining the In the end, we selected 51 dataset papers to study thor- label-related products. The method employed to define oughly. We also analysed shorter datasets of natural topics such objects (which maps to various sub-chunks depend- such as animals (e.g. [18]) and flora [29], as we noticed that ing on the dataset), their source, and the individuals who although they serve for the same type of classification tasks, define them are the main hyperparameters.

7 • Data collection. It consists in collecting a set of po- tential image-related products for the dataset. A mul- titude of hyperparameters lay within this activity, stem- ming again from the source of images, the individuals collecting them, and the general approach to gather them. • Data annotation. It consists in collecting the set of image- annotation tuples of the dataset. It is not needed when im- ages are photographed by the researchers for pre-defined labels, but it is when they are collected from the Inter- net. This chunk strategy is instantiated in different task- chunks, with an automatic process (pre-trained machine learning models, or meta data of the images), a human process or a hybrid one. The human process chunks can then be defined into sub-chunks. • Data filtering. It consists in fine-tuning the set of image products by removing images that do not fit the require- ments fixed for the dataset. It impacts the properties of the image-related and possibly annotation-related products. • Data processing. It consists in transforming the individual images in the image-related products according to certain processing criteria for the images to verify certain prop- erties usually in terms of size (downsampling or upsam- pling), and possibly color hue, lighting, etc. This is done to support their use within machine learning classifiers, and in some cases to simplify the learning tasks. • Data augmentation. Although not an activity mentioned in dataset papers, it is a standard practice in industry, and re-uses task-chunks of data processing for modifying image-related products (especially the size of the set). • Data splitting. The dataset is divided in a training(, val- idation) and test sets, that create new versions of all the dataset products. • Product refinement. Although not considered in any com- puter vision paper, there is an additional set of chunks re- lated to dataset transformations generally happening af- ter deployment. Datasets might be reduced in size, aug- mented with new labels and images, or deprived from some, as ethical or performance issues are identified or distribution shifts happen. This impacts the properties of each dataset product. Surfacing these chunks might pro- vide insights to develop best practices or critical views on the handling of harms.

8