Exploring Data Pipelines Through the Process Lens: a Reference Model for Computer Vision

Exploring Data Pipelines through the Process Lens: a Reference Model for Computer Vision Agathe Balayn Bogdan Kulynych Seda Gurses¨ TU Delft EPFL TU Delft [email protected] [email protected] [email protected] Abstract Recent exceptions [23, 26,7, 10, 16] hint at another direc- tion, that explores not only the datasets themselves but con- Researchers have identified datasets used for training siders the data pipelines and actors behind their creation computer vision (CV) models as an important source of and use more broadly. In this paper, we take inspiration hazardous outcomes, and continue to examine popular CV from these works and explore a process-oriented lens on datasets to expose their harms. These works tend to treat data pipelines. Such a lens may encourage researchers and datasets as objects, or focus on particular steps in data practitioners to be more rigorous in the production and the production pipelines. We argue here that we could further analysis of datasets vis a vis their potential societal impact. systematize our analysis of harms by examining CV data Most importantly, a process lens will enable to recognize pipelines through a process-oriented lens that captures the and capture the following: creation, the evolution and use of these datasets. As a step Datasets as complex, living objects. Datasets are not in- towards cultivating a process-oriented lens, we embarked ert but living objects. They are regularly updated in produc- on an empirical study of CV data pipelines informed by the tion to improve models, while they are often re-purposed field of method engineering. We present here a preliminary and adapted for new applications. Moreover, datasets inter- result: a reference model of CV data pipelines. Besides ex- act with each other, particularly through models, e.g. they ploring the questions that this endeavor raises, we discuss are composed using transfer learning [19] such as for im- how the process lens could support researchers in discov- proving facial recognition systems meant to be applied in a ering understudied issues, and could help practitioners in wide variety of contexts [25]. However, while datasets can making their processes more transparent. create harms at each phase of their life, academic works on datasets typically are ambiguous about these phases and seem to be based only on their first release. 1. Introduction The pipeline as a vector of harms. Not only the processes of the pipeline that follow the first release might raise is- Training data can have a considerable impact on the out- sues, but also the interplay between these processes and the puts of a machine learning system, by affecting its accu- ones leading to the first release, as it defines the potential racy, or causing undesirable or even harmful effects when errors of a model. For instance, in Detroit, following harm- arXiv:2107.01824v1 [cs.CV] 5 Jul 2021 the system is deployed in applications that act in the world. ful mistakes of a facial recognition system in deployment, In response to these harms, machine-learning and in- the police has decided to only apply it to still images, as it is terdisciplinary researchers who tackle the societal impact closer to the training data collected in a static setting in de- of computer vision typically a) explore characteristics of velopment, which shows how some harms can occur due to datasets with respect to their potentially harmful impacts on problematic mismatches between processes in deployment society [8, 28, 22], e.g., taxonomies with offensive labels, and development phases [1]. Some existing works have often resulting from uncritical practices [2]; b) identify and already investigated the impact of pipeline processes such mitigate harms by adapting single, discrete processes in the as rotation, cropping, and other pre-processing on models’ data pipeline, e.g. filtering out the offensive and non-visual accuracy [11, 15, 34], although not necessarily on societal labels of the taxonomy, [33,2, 17]; c) propose techniques to harms [8]. Nonetheless, little attention was given to study- make datasets more transparent [12], to encourage analysts ing the impact of sequences of processes. to be more accountable and reflective of potential harms. Reasoning behind data pipeline design. The lack of These forms of enquiry presuppose considering datasets transparency about the design of a dataset’s pipeline (see ex- as objects with configurable and conceivable properties. amples in Appendix A) makes it challenging for researchers 1 to analyse the procedural harms or downstream harms of be too generic for our desired level of detail. However, the models training on a dataset. Such transparency would re- method engineering formalisms can alleviate these issues veal not only the processes themselves, but also the reason- as they support processes of different granularities, and the ing behind them. For instance, while it is valuable to study interactions between the processes can be represented in a datasets in-depth, it can be intractable to analyse each label map without necessarily enforcing order. and image due to the scale. Documentation of the reasoning Thus, we employed a mix of methods from the different behind the choices of label ontologies, and the process of disciplines. The details can be found in Appendix B. In collecting corresponding images, would improve the analy- short: 1) We did a systematic literature review that presents sis. E.g., knowing that WordNet was used to select the la- computer vision datasets (51 publications) or specific pro- bels of ImageNet [9] enables us to directly reflect on harms cesses to build datasets (16 publications) selected out of 220 from the point of view of the taxonomy [8], and knowing relevant papers found using different search methods. 2) We that images were collected from country-specific search en- identified and noted the main processes within each publi- gines enables us to foresee the cultural skews in label repre- cation. 3) By comparing each description, we iteratively sentations [28]. Such matters are not consistently included identified common processes, refined their granularity, and in academic works, and they are not always explicitly asked aggregated them into meaningful method processes. 4) We for in documentation methodologies [12]. described each of them with a relevant formalism, and drew Our contributions. In this paper, we devise a systematic the maps of each dataset pipeline. 5) We searched for grey approach to study the data pipelines in computer vision. For areas and are in the process of conducting expert interviews this, we combine techniques from the development of refer- to further validate and refine our reference model, and to ence models, method engineering, and process engineering. model processes that are not objects of scientific publica- These are well-established, intertwined fields [31] that pro- tions (e.g. processes for serving data in deployment). pose methods to capture complex sets of processes. Using these techniques, we develop an empirically informed ref- 3. Results: Reference model of data processes erence model of data pipelines in computer vision. We then In this section, we present our formalism of the compo- reflect on its potential usefulness, as well as the possible nents of the reference model summarized in Figure 1, and limitations of this approach for creating transparency and provide examples of its application. accountability for practitioners and researchers. 3.1. Products 2. Methodology The products are the essential components of the datasets. Each product is defined by an object (e.g. an im- Our objective is to build a reference model of the data age, a set of images, an image-annotation pair, etc.) and a pipeline in machine learning-based computer vision tasks. set of properties. Based on the literature review, we have The literature on method, process and reference-model en- identified the following types of products: gineering identifies the following requirements towards ref- Label-related products. A dataset comprises a set of tar- erence models: representativity of the pipeline used in prac- get labels, often organized into a hierarchical taxonomy, tice at an abstraction level high enough for including com- with higher levels of the taxonomy representing more ab- prehensively each of their processes [21, 13,4], modularity stract concepts (e.g. animal) and lower levels representing to easily add new processes [14,4], clarity for individu- more fine-grained concepts (e.g. types of animals). The als interested in different parts of the pipeline [21,4], and main properties considered in the reviewed literature are the sufficient level of detail to enable actionable reporting, and size (number of labels) and the number of levels in the hi- reflection on potential ethical issues. erarchy. Many properties of labels are currently implicit, Following these requirements, we build: a high-level ref- whereas other properties, like where the taxonomy origi- erence model that serves as an ontology of our main con- nates from, can be relevant for capturing potential harms. cepts; a set of low-level models of each identifiable process Image-related products. A dataset typically contains a within existing pipelines; a set of process maps reflecting set of images. The size of the set, the content of the images the pipelines of the main computer vision practices. and especially its diversity are put forward as main proper- Populating the models and maps. The methods from ties in the reviewed literature. Additional properties can be each of the mentioned fields have some advantages and lim- considered, especially the metadata attached to the images itations. Process patterns in process engineering allow for (e.g. author of the images, etc.), physical characteristics of a fine-grained description of the processes, but generally the images such as their quality, resolution, etc., and the impose an order on these processes, which does not cor- distribution of these characteristics across the set.

Load more