Neural Naturalist: Generating Fine-Grained Image Comparisons � � � Maxwell Forbes Christine Kaeser-Chen Piyush Sharma Serge Belongie University of Washington Google Research Cornell University and Cornell Tech [email protected] fchristinech,[email protected] [email protected] https://mbforbes.github.io/neural-naturalist/ Abstract perceptual difficulty: high We introduce the new Birds-to-Words dataset of 41k sentences describing fine-grained dif- vs ferences between photographs of birds. The language collected is highly detailed, while Animal 1 Animal 1 Animal 2 Animal 2 remaining understandable to the everyday “Animal 2 looks smaller and has a stouter, darker bill than Animal observer (e.g., “heart-shaped face,” “squat 1. Animal 2 has black spots on its wings. Animal 2 has a black hood that extends down onto its breast, and the rest of its breast is body”). Paragraph-length descriptions natu- white with orange only on its sides. In comparison, Animal 1’s breast is entirely orange.” rally adapt to varying levels of taxonomic and detailed highly visual distance—drawn from a novel strati- fied sampling approach—with the appropriate perceptual difficulty: medium level of detail. We propose a new model called Neural Naturalist that uses a joint image en- coding and comparative module to generate vs comparative language, and evaluate the results with humans who must use the descriptions to Animal 1 Animal 1 Animal 2 Animal 2 distinguish real images. “Animal 2 is brightly red-colored all over, except for a black oval around its beak. Animal 1 has more muted red and grey colors.” Our results indicate promising potential for fewer details neural models to explain differences in visual descriptive phrase body part embedding space using natural language, as well as a concrete path for machine learning to aid citizen scientists in their effort to preserve Figure 1: The Birds-to-Words dataset: comparative de- biodiversity. scriptions adapt naturally to the appropriate level of de- tail (orange underlines). A difficult distinction (TOP) is 1 Introduction given a longer and more fined-grained comparison than an easier one (BOTTOM). Annotators organically use Humans are adept at making fine-grained compar- everyday language to refer to parts (green highlights). isons, but sometimes require aid in distinguishing visually similar classes. Take, for example, a cit- izen science effort like iNaturalist,1 where every- Field guides exist for the purpose helping peo- arXiv:1909.04101v3 [cs.CL] 14 Nov 2019 day people photograph wildlife, and the commu- ple learn how to distinguish between species. Un- nity reaches a consensus on the taxonomic label fortunately, field guides are costly to create be- for each instance. Many species are visually sim- cause writing such a guide requires expert knowl- ilar (e.g., Figure1, top), making them difficult for edge of class-level distinctions. a casual observer to label correctly. This puts an In this paper, we study the problem of explain- undue strain on lieutenants of the citizen science ing the differences between two images using nat- community to curate and justify labels for a large ural language. We introduce a new dataset called number of instances. While everyone may be ca- Birds-to-Words of paragraph-length descriptions pable of making such distinctions visually, non- of the differences between pairs of bird pho- experts require training to know what to look for. tographs. We find several benefits from eliciting Work done during an internship at Google. comparisons: (a) without a guide, annotators nat- 1https://www.inaturalist.org urally break down the subject of the image (e.g., a bird) into pieces understood by the everyday ob- server (e.g., head, wings, legs); (b) by sampling comparisons from varying visual and taxonomic increasing visual and taxonomic distance distances, the language exhibits naturally adap- pivot pivot visual visual tive granularity of detail based on the distinctions required (e.g., “red body” vs “tiny stripe above its eye”); (c) in contrast to requiring comparisons between categories (e.g., comparing one species vs. another), non-experts can provide high-quality species species annotations without needing domain expertise. We also propose the Neural Naturalist model architecture for generating comparisons given two … genus genus images as input. After embedding images into a … latent space with a CNN, the model combines the two image representations with a joint encoding and comparative module before passing them to a Transformer decoder. We find that introducing a order order comparative module—an additional Transformer pivot species … encoder—over the combined latent image repre- sampled subtree sentations yields better generations. sampling cut off too distant at taxonomic CLASS Our results suggest that these classes of neural models can assist in fine-grained visual domains when humans require aid to distinguish closely related instances. Non-experts—such as amateur Figure 2: Illustration of pivot-branch stratified sam- pling algorithm used to construct the Birds-to- naturalists trying to tell apart two species—stand Words dataset. The algorithm harnesses visual and to benefit from comparative explanations. Our taxonomic distances (increasing vertically) to create a work approaches this sweet-spot of visual exper- challenging task with board coverage. tise, where any two in-domain images can be com- pared, and the language is detailed, adaptive to the types of differences observed, and still under- and an image from a candidate class. By differ- standable by laypeople. entiating between these two inputs, a model may Recent work has made impressive progress on help point out subtle distinctions (e.g., one animal context sensitive image captioning. One direction has spots on its side), or features that indicate a of work uses class labels as context, with the ob- good match (e.g., only a slight difference in size). jective of generating captions that distinguish why These explanations can aid in understanding both the image belongs to one class over others (Hen- differences between species, as well as variance dricks et al., 2016; Vedantam et al., 2017). An- within instances of a single species. other choice is to use a second image as context, 2 Birds-to-Words Dataset and generate a caption that distinguishes one im- age from another. Previous work has studied ways Our goal is to collect a dataset of tuples (i1; i2; t), to generalize single-image captions into compar- where i1 and i2 are images, and t is a natural lan- ative language (Vedantam et al., 2017), as well guage comparison between the two. Given a do- as comparing two images with high pixel overlap main D, this collection depends critically on the (e.g., surveillance footage) (Jhamtani and Berg- criteria we use to select image pairs. Kirkpatrick, 2018). Our work complements these If we sample image pairs uniformly at random, efforts by studying directly comparative, everyday we will end up with comparisons encompassing language on image pairs with no pixel overlap. a broad range of phenomena. For example, two Our approach outlines a new way for models images that are quite different will yield categor- to aid humans in making visual distinctions. The ical comparisons (“One is a bird, one is a mush- Neural Naturalist model requires two instances as room.”). Alternatively, if the two images are very input; these could be, for example, a query image similar, such as two angles of the same creature, Images Dataset Domain Lang Ctx Cap Example CUB Captions Birds M 1 1 “An all black bird with a very long rectrices and relatively dull (R, 2016) bill.” CUB-Justify Birds S 7 1 “The bird has white orbital feathers, a black crown, and yellow (V, 2017) tertials.” Spot-the-Diff Surveilance E 2 1–2 ”Silver car is gone. Person in a white t shirt appears. 3rd person (J&B, 2018) in the group is gone.” Birds-to-Words Birds E 2 2 “Animal1 is gray, while animal2 is white. Animal2 has a long, (this work) yellow beak, while animal1’s beak is shorter and gray. Animal2 appears to be larger than animal1.” Table 1: Comparison with recent fine-grained language-and-vision datasets. Lang values: S = scientific, E = everyday, M = mixed. Images Ctx = number of images shown, Images Cap = number of images described in caption. Dataset citations: R = Reed et al., V = Vedantam et al., J&B = Jhamtani and Berg-Kirkpatrick. comparisons between them will focus on highly Birds-to-Words detailed nuances, such as variations in pose. These phenomena support rich lines of research, such as object classification (Deng et al., 2009) and pose estimation (Murphy-Chutorian and Trivedi, 2009). We aim to land somewhere in the middle. We wish to consider sets of distinguishable but inti- Birds-to-Words Dataset mately related pairs. This sweet spot of visual Image pairs 3,347 Paragraphs / pair 4.8 similarity is akin to the genre of differences stud- Paragraphs 16,067 ied in fine-grained visual classification (Wah et al., Tokens / paragraph 32.1 MEAN 2011; Krause et al., 2013a). We approach this col- Sentences 40,969 lection with a two-phase data sampling procedure. Sentences / paragraph 2.6 MEAN We first select pivot images by sampling from our Clarity rating ≥ 4=5 full domain uniformly at random. We then branch Train / dev / test 80% / 10% / 10% from these images into a set of secondary im- Figure 3: Annotation lengths for compared datasets ages that emphases fine-grained comparisons, but (TOP), and statistics for the proposed Birds-to- yields broad coverage over the set of sensible re- Words dataset (BOTTOM). The Birds-to-Words dataset lations. Figure2 provides an illustration of our has a large mass of long descriptions in comparison to sampling procedure. related datasets. 2.1 Domain We sample images from iNaturalist, a citizen sci- ison pointing out the differences in animal type.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages14 Page
-
File Size-