Arxiv:1909.08859V1 [Cs.CL] 19 Sep 2019 Derstanding and Reasoning About Procedural Texts Is Quite Easy to Distinguish Raw Meat from Cooked (E.G
Total Page:16
File Type:pdf, Size:1020Kb
Procedural Reasoning Networks for Understanding Multimodal Procedures Mustafa Sercan Amac Semih Yagcioglu Aykut Erdem Erkut Erdem Hacettepe University Computer Vision Lab Dept. of Computer Engineering, Hacettepe University, Ankara, TURKEY fb21626915,n13242994,aykut,[email protected] Abstract instruction “salt and pepper each patty and cook for 2 to 3 minutes on the first side” in Step 5 entails This paper addresses the problem of com- mixing three basic ingredients, the ground beef, prehending procedural commonsense knowl- edge. This is a challenging task as it re- salt and pepper, together and then applying heat quires identifying key entities, keeping track to the mix, which in turn causes chemical changes of their state changes, and understanding tem- that alter both the appearance and the taste. From poral and causal relations. Contrary to most a natural language understanding perspective, the of the previous work, in this study, we do main difficulty arises when a model sees the word not rely on strong inductive bias and explore patty again at a later stage of the recipe. It still cor- the question of how multimodality can be ex- responds to the same entity, but its form is totally ploited to provide a complementary semantic signal. Towards this end, we introduce a new different. entity-aware neural comprehension model aug- Over the past few years, many new datasets and mented with external relational memory units. approaches have been proposed that address this in- Our model learns to dynamically update en- herently hard problem (Bosselut et al., 2018; Dalvi tity states in relation to each other while read- et al., 2018; Tandon et al., 2018; Du et al., 2019). ing the text instructions. Our experimental To mitigate the aforementioned challenges, the ex- analysis on the visual reasoning tasks in the isting works rely mostly on heavy supervision and recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the focus on predicting the individual state changes previously reported models by a large margin. of entities at each step. Although these models Moreover, we find that our model learns effec- can accurately learn to make local predictions, they tive dynamic representations of entities even may lack global consistency (Tandon et al., 2018; though we do not use any supervision at the Du et al., 2019), not to mention that building such 1 level of entity states. annotated corpora is very labor-intensive. In this 1 Introduction work, we take a different direction and explore the problem from a multimodal standpoint. Our basic A great deal of commonsense knowledge about the motivation, as illustrated in Fig.1, is that accompa- world we live is procedural in nature and involves nying images provide complementary cues about steps that show ways to achieve specific goals. Un- causal effects and state changes. For instance, it arXiv:1909.08859v1 [cs.CL] 19 Sep 2019 derstanding and reasoning about procedural texts is quite easy to distinguish raw meat from cooked (e.g. cooking recipes, how-to guides, scientific pro- one in visual domain. cesses) are very hard for machines as it demands In particular, we take advantage of recently pro- modeling the intrinsic dynamics of the procedures posed RecipeQA dataset (Yagcioglu et al., 2018), a (Bosselut et al., 2018; Dalvi et al., 2018; Yagcioglu dataset for multimodal comprehension of cooking et al., 2018). That is, one must be aware of the recipes, and ask whether it is possible to have a entities present in the text, infer relations among model which employs dynamic representations of them and even anticipate changes in the states of entities in answering questions that require multi- the entities after each action. For example, consider modal understanding of procedures. To this end, in- the cheeseburger recipe presented in Fig.1. The spired from (Santoro et al., 2018), we propose Pro- 1The project website with code and demo is available at cedural Reasoning Networks (PRN) that incorpo- https://hucvl.github.io/prn/ rates entities into the comprehension process and al- salt pepper lettuce leaf ground beef pepper dressing hamburger bun ground beef ground beef ground beef hamburger bun ground beef ground beef hamburger bun hamburger bun ground beef ground beef ground beef onion ground beef hamburger bun American cheese ground beef ground beef salt tomato ground beef Step 1: Ingredients and Tools Step 2: Form Patties Step 3: Season Step 4: Toast Buns 1 hamburger bun, 4 oz. ground beef (25-30% fat Begin by preheating a cast iron skillet over medium heat. Make four patties by Salt and pepper one side of the patty now, the other Lightly toast the both halves of the hamburger if available) (2 ounce per patty), salt and rolling 2-ounce portions of beef into balls and weigh it out on the kitchen scale. half will be done when grilling. bun, face down in the pan. Set aside. pepper, Thousand Island dressing (or In-N-Out In-N-Out uses a 25-30% fat beef patty which is not easily available at a local official spread), 1 large tomato, 1 large lettuce grocery store, in many cases it would have to be ground by hand. Forming them leaf, 1 whole onion, 2 slices real American slightly larger than buns. I do this by placing the 2 ounce beef in between 2 cheese pieces of parchment paper then taking my large cast iron skillet and applying a little force to smash the beef into a patty. You will want to form them into a perfect circle with your hand if they do not come out right after the initial smash. onion lettuce leaf lettuce leaf dressing onion onion tomato dressing onion hamburger bun ground beef salt tomato hamburger bun hamburger bun dressing hamburger bun ground beef tomato salt pepper Step 5: Cook Step 6: Chop Onions & Tomatoes Step 7: Chop Onions & Tomatoes Step 8: Enjoy Set the patty seasoned side down on the skillet, salt and pepper For the "authentic" feel you want to get a large Assemble the burger in the following stacking order from the All that's left to do is enjoy this copycat double each patty and cook for 2 to 3 minutes on the first side. Flip the onion and a large tomato, then slice a large slice bottom up: bottom bun, thousand island dressing, tomato, lettuce, double! To be honest, this was impressively close patties over and season with salt and pepper and immediately from the middle to use on the hamburger. beef patty with cheese, onion slice, beef patty with cheese, top to the real taste. I would definitely make this one place one slice of cheese on each one. Cook for 2-3 minutes on bun again. the other side. Figure 1: A recipe for preparing a cheeseburger (adapted from the cooking instructions available at https: //www.instructables.com/id/In-N-Out-Double-Double-Cheeseburger-Copycat). Each basic in- gredient (entity) is highlighted by a different color in the text and with bounding boxes on the accompanying images. Over the course of the recipe instructions, ingredients interact with each other, change their states by each cooking action (underlined in the text), which in turn alter the visual and physical properties of entities. For instance, the tomato changes it form by being sliced up and then stacked on a hamburger bun. lows to keep track of entities, understand their inter- temporal relationships between the cooking actions actions and accordingly update their states across and the entities. time. We report that our proposed approach signifi- cantly improves upon previously published results Visual Coherence. The visual coherence task tests on visual reasoning tasks in RecipeQA, which test the ability to identify the image within a sequence understanding causal and temporal relations from of four images that is inconsistent with the text images and text. We further show that the dynamic instructions of a cooking recipe. To succeed in this entity representations can capture semantics of the task, a model should have a clear understanding state information in the corresponding steps. of the procedure described in the recipe and at the same time connect language and vision. 2 Visual Reasoning in RecipeQA Visual Ordering. The visual ordering task is about In our study, we particularly focus on the visual grasping the temporal flow of visual events with reasoning tasks of RecipeQA, namely visual cloze, the help of the given recipe text. The questions visual coherence, and visual ordering tasks, each show a set of four images from the recipe and the of which examines a different reasoning skill2. We task is to sort jumbled images into the correct order. briefly describe these tasks below. Here, a model needs to infer the temporal relations between the images and align them with the recipe Visual Cloze. In the visual cloze task, the question steps. is formed by a sequence of four images from consecutive steps of a recipe where one of them is 3 Procedural Reasoning Networks replaced by a placeholder. A model should select the correct one from a multiple-choice list of four In the following, we explain our Procedural Reason- answer candidates to fill in the missing piece. In ing Networks model. Its architecture is based on a that regard, the task inherently requires aligning bi-directional attention flow (BiDAF) model (Gard- visual and textual information and understanding ner et al., 2018)3, but also equipped with an explicit reasoning module that acts on entity-specific rela- 2We intentionally leave the textual cloze task out from our experiments as the questions in this task does not necessarily 3Our implementation is based on the implementation pub- need multimodality.