DAILY

Interview with: Presenting Work by: Editorial with: Antonio Torralba Katie L. Bouman Rita Cucchiara John McCormac Today’s Picks by: Workshop: Dèlia The Joint Video and Language... Fernandez

A publication by Wednesday 2 Dèlia’s Picks For today, Wednesday 25 Dèlia Fernandez is an industrial PhD candidate at the startup company Vilynx and the group of Image and Video Processing (GPI) in the Polytechnic University of Catalonia (BarcelonaTech). “My research topic aims at introducing knowledge bases in the automatic understanding of video content. ICCV is the perfect place to learn about last trends and techniques for image and video analysis, get in touch with the community and hear about new interesting topics and challenges.” Dèlia presented a paper in the workshop on web-scale vision and social media. The paper is "VITS: Video Dèlia Fernandez Tagging System from Massive Web Multimedia Collections". Find it here!

Dèlia’s picks of the day: • Morning P3-34: Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection P3-42: Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding P3-50: Unsupervised Learning of Important Objects From First-Person Videos P3-52: Visual Relationship Detection With Internal and External Linguistic Knowledge Distillation P3-72: Common Action Discovery and Localization in Unconstrained Videos

P4-69: Learning View-Invariant Features for Person Identification in Temporally Synchronized Videos Taken by Wearable Cameras P4-74: Chained Multi-Stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

“Venice is a magic place where you can feel Italian traditions alive! I was delighted to find out ICCV was taking place here and I am enjoying the chance to have nice Italian food and explore the city.” Wednesday Summary 3

Dèlia’s Picks Antonio Torralba Katie L. Bouman

06 14 Venezia Lido 02 Workshop The Joint Video and… Editorial

17 Subscribe for free to 04 10 News

Opening Talk John McCormac

12 05 18

ICCV Daily All rights reserved Our editorial choices are fully Publisher: RSIP Vision Unauthorized reproduction independent from ICCV and its Copyright: RSIP Vision is strictly forbidden. organizers, Wednesday 4 Editorial

Good morning ICCV! In this record-year for ICCV, what will be the new trends and approaches? We can figure it by looking at some stats on submitted papers, collected during this long period of review-rebuttal-review… We presented them at yesterday’s Opening Talk, but the short time available did not allow to cover everything. If you look at the most popular macro categories of topics they are consistently the same: the large majority fall into the category of “Recognition, detection matching”: 497 papers (of which 156 accepted for the conference), while the “3D computer vision”, “image processing”, “motion and tracking”, “video events and activities” as well as “learning methods” seem to be the building blocks of our conference days. Probably the most focused sub-categories for the high acceptance rate and high number of submitted papers are four: two of them are more related to theory - Discrete Optimization (26 papers accepted among 56, a 46.6% acceptance rate) and 3D Modeling and reconstruction (25 papers accepted among 60, a 41.7% acceptance rate) while two others related to applications: one of them is obviously “autonomous driving” (among the 19 submitted papers, 10 were accepted, a 52.6% acceptance rate); also video and language seem to be a hot topic, since among 13 proposals, 7 of them have been accepted (a 54% acceptance rate) ICCV is a Traditional conference in terms of titles with some kind of fantasy: we did an histogram of the most common words in the title and more than 300 papers’ titles contained the word “learning” together with “deep” or “network” or “convolutional”; among the 20 worlds more used we can find “tracking”, “re-identification”, “person”, “face” and “action”, typically exploited in our title while the new entry in this list is “adversarial”. GANs becoming very popular if you think to the about 1,000 people attending Ian Goodfellow’s tutorial on GANs’, a couple of days ago. And finally some worlds so common in the title of the past are disappearing. In the set of the 50 worlds used only once or twice in the title, we found something as “objectness” or “abnormal” or “Bayesian” and “PCA” that characterized our research in the past. Sic transit gloria mundi. The world is changing and the words too. We let you discover more topics and keywords in this second ICCV Daily. Whether you are here in Venice or anywhere else, stay connected with the community by reading the magazine of ICCV. Enjoy the reading! Rita Cucchiara Ralph Anzarouth Professor, Università di Modena e Reggio Emilia Editor, Computer Vision News Program Chair, ICCV 2017 Marketing Manager, RSIP Vision Wednesday Opening Talk 5

General Chair Marcello Pelillo during the Opening Talk of ICCV2017. He was kind enough to dedicate one of his slides to the ICCV Daily, the new publication originated by the partnership between ICCV and Computer Vision News, the magazine of the community published by RSIP Vision Wednesday 6 Antonio Torralba

Originally from Spain, Antonio Torralba is a Professor at MIT in the and artificial intelligence laboratory. “We are missing the proximity with the students when you have such a big class” Photo by Lillie Paquette / MIT School of Engineering Antonio, how long have you been at that requires a lot of time. Otherwise, MIT? ideas don’t come. You don’t have time I’ve been there since 2000. to play and take risks. Because if you have little time, you need to go to That means that you have been at MIT something that is for sure, and you for seventeen years! How has MIT need to publish. In Spain, you have to changed since that time? publish a number of papers in order to Particularly with fields that are related get promotions. This is very different in to artificial intelligence, the changes are the US. In the US, you have a lot of constant. We are revising the time. You have to teach maybe 30 hours curriculum every year. Whenever you per semester. This is very little, in fact. teach something, in the next year, half Even if we think that teaching takes a of the slides are obsolete. lot of time for us, it really doesn’t take One of the things that has changed a lot that much time. You have a lot of time is the number of students per class. I for research, which means that you can started teaching in 2007 or 2008. I take risks. That’s what makes research started teaching a computer vision class great. I think that’s what is missing in there with Bill Freeman. There were like Spain sometimes in the computer 30 students, and I think half of them science field. Other fields are different, just thought the class was about but in computer science, I think that something different. It was just very, they are too constrained by the amount very few people. Now we have 300 of hours they need to teach, but they people in the class, and they all know have great people. what we are doing! [both laugh] … and You are teaching relatively less hours there’s many people that couldn’t get in America, while having 300 students, in! It’s a totally different field. how can you get to know your students? Can you tell us about academia in That’s true. There is this problem that Spain versus in the States? we are missing the proximity with the It depends also on the fields, but in students when you have such a big computer science in particular, one of class. That’s something that is lost, but the main issues in Spain is that the there is a benefit actually. When you professors at the university have to do a have so many students, there are lot of teaching. Research is a discipline people that are really, really motivated Wednesday Antonio Torralba 7

about the field. They come to talk to who I told to come to my lab. It was you. In the end, you encounter them in three students, two females and one some way or another. For instance, at male. From those students, only the MIT a few students were explaining to two females could come because the me all of the things that they do. They male already had some other lab where were first year undergrad students, and he was working. I thought, “Wow, this group is really What about ICCV? How has ICCV amazing! I want you all to come to my changed over the years? lab and work with me.” They did come! I’m having a blast with them. These are ICCV, just like the computer vision class, undergrads with no experience, but is getting larger and larger. It started out they are so motivated. This is one of the very small. You could almost know benefits of having a big class. everyone at the conference. I remember the first ones. Well, the first Can you tell us about a “Wow” one wasn’t very familiar to me because moment that you had with your I was coming from a very different students, a moment when you were so setting. I was in a lab in France. I only impressed by one of your students? knew the big professors from their A lot of us professors share the same names and papers. For me to see them, anecdote. Generally, I think the thing it was like seeing a Hollywood star for that impresses us the most with the the first time on the street. It was very students is when you have a student, shocking. The first conferences were you give them something to do, a like that. Suddenly, you could see the problem that needs to be solved like human nature behind all of these big this. The student later on, comes back, names who are doing all of these great and says, “You know, what you things. It was still very small so you got proposed does not work, but here is to know them, and you got to know the how to make it work.” That’s what I students. It was really interesting. love. There are some students that might just say, “Oh this does not work.” Well, probably I already knew it was not going to work, but the student that comes back to you and finds the mistakes in what you were saying, finds the solution, another way of looking at the problem, or maybe even changes the problem so that it has a solution, that’s when you say, “Wow!” These things happen. Does it happen with both male and female students? Yea, yea, of course. The thing is that there are not many female students. But for instance, I was telling you about Working on context-based the students that invited me for dinner object recognition in 2003 Wednesday 8 Antonio Torralba

Nowadays, at least for me, it’s very from the mistakes. You learn where different because the conference is so things are confusing. You gain large that it’s very hard to know a lot of confidence when you say, “Oh, that people. One of the things that I used to step is difficult. I know. I mean I saw the do when I attended this conference at professor making a mistake here.” the beginning is that I would go out Maybe even there are bugs. Sometimes with people to socialize. They were not there are errors in the equation. You people from MIT. I generally would learn a lot about debugging. I think with hang out with people from Berkeley, the slides, it’s too clean. It’s too perfect. CMU, all kinds of universities. It was a Sometimes you need to see what you very diverse group of researchers. are learning. There was almost no one from MIT so I You are not challenged to find the couldn’t be with my own group. I think mistakes. that when the conference gets so large, one of the issues is that you get That’s right, that’s right. It’s like learning intimidated by so many people. In the a language when you learn to program. end, you cluster with the people that You really learn a lot when you have a you already know. I think that’s a pity. small manual, and there is a lot of information missing. You have to “You really learn a lot when you discover it by yourself, and there are have a small manual, and there mistakes. You have all of these bugs. is a lot of information missing” You are learning a lot from that. That has changed a lot in education, I think. How has teaching changed in your view? Well, we use a lot more PowerPoint than before. And less chalk. Yes, a lot less chalk. That’s an important thing. I don’t know. It has benefits and disadvantages. I think it’s really standardized, especially in computer science in the classes that would be relevant to ICCV like computer vision, image processing, and . They are very much PowerPoint based and slides. It has the benefit that you can cover a lot of ground. The students will have the slides, and they can just look at them. But there is something interesting about seeing a professor write on the “It is still hard to tell if I am board, going at the speed of thought, playing the guitar or banging and making mistakes. You learn a lot a saucepan against the wall…” Wednesday Antonio Torralba 9

On the other hand, now it’s much see that somebody has already done it easier to find information to because everything goes so fast now. compliment whatever you are studying. I think that it is very important not to In the past, you also had libraries with get discouraged by your papers getting all of these books. Of course, people rejected and not to feel discouraged by normally didn’t go there, but you had seeing someone who has done all of that knowledge there. It just something already that you wanted to wasn’t as easily available. I think that do. I think that it is an important probably another thing that has experience to go to these conferences changed a lot is the fact that now, you to see what the dialogue is, what can not only read a paper or a book, people are thinking about, or the you can actually go to the source. You problems that people are thinking can find the webpage of the person about. I think that it is also very who actually created that algorithm. important to try to exploit whatever Suddenly, you can put a face to the makes you different, whatever makes name and see the human nature behind you original. You might have a set of the work. You can send an email, and skills that makes you very distinct to sometimes, they reply! other people. Try to leverage on that. I think that’s very important because it Find what it is, try to exploit it, and find breaks up some sort of illusion that a direction that you will feel motivated something is too difficult to get. It about. Whenever you see that makes things look more doable. someone has done something in that What would you like to tell students to direction, don’t feel discouraged. Feel that you are gaining some knowledge encourage them? out of it. How can you bring your own One of the things when you are a originality and build your own student is that it’s very easy to get signature? What is going to make you overwhelmed. You see all of the papers. special? That is something that you Many times you have an idea, and you have to educate.

Some of the past and present members of Antonio’s group, during a dinner at the last CVPR in Hawaii. Sitting in the middle, Phillip Isola Wednesday 10 Workshop

The Joint Video and Language Understanding Workshop: MovieQA and The Large Scale Movie Description Challenge (LSMDC) The workshop was organized by approach to joint audio-visual Anna Rohrbach​, Makarand understanding for a novel task of audio Tapaswi, Atousa Torabi​, Tegan image captioning and discussed Maharaj, Marcus Rohrbach, generation of audio from video. In Sanja Fidler, Christopher Pal and order to perform such multi-modal Bernt Schiele. tasks, Antonio argued that neural networks need to discover objects in the scene in an unsupervised manner. Next, the LSMDC 2017 results were announced for all four challenge tracks: The winner of the Movie Description track is the team from Ohio State University and the Air Force Research Lab (Oliver Nina, Scott Clouse, Alper Yilmaz). Their approach relies on augmenting LSMDC with multiple captions and exploiting multiple decoders. Impressively, the team from Seoul National University (YoungJae Yu, Jongseok Kim, Gunhee Kim) won the three remaining challenge tracks (Movie Retrieval, Movie Multiple-Choice Test, and Movie Fill-In-the-Blank), improving by at least The workshop and challenge series 8% over last year’s winners in all three started in 2015 at ICCV, and was also presented at ECCV 2016 jointly with challenges. See the workshop’s webpage for more details. the Storytelling workshop. This year, for the first time, the workshop Shortly after, Kate Saenko of Boston U. included the MovieQA challenge. The gave an insightful presentation about workshop’s goal is to bring together the interpretability of neural networks. researchers on video description and In image and video captioning, one understanding as well as to present expects that the networks are and discuss the results of the LSMDC “looking” at the relevant visual and MovieQA challenges. regions. However, networks are often led by incorrect biases in the training The workshop kicked off with an data, e.g. woman and kitchen, and exciting talk by Antonio Torralba of MIT (read his full interview in this mislead by language priors. Kate proposed methods which provide magazine), exceeding room capacity insights into the internal reasoning of with people sitting on the floor. Antonio motivated self-supervised neural networks. multi-modal learning as the natural During the coffee break, everyone had way in which babies and toddlers learn a chance to discuss 12 poster about the world. He presented their presentations as well as the challenge Wednesday Workshop 11

submissions in more detail. and subtitles respectively. The team Even with all the chairs from the from Seoul National University and SK adjacent room being added, people Telecom Video Lab (Seil Na, Sangho were still standing inside and outside Lee, Gunhee Kim) came a close second the room for Andrew Zisserman’s (U. with an approach that used pooling of of Oxford, DeepMind) talk about the aligned video frames and dialogs. long-standing problem of character In the last invited talk, Ivan Laptev labeling in TV series and movies. (INRIA) presented the complexities of Andrew presented their step by step defining an action: granularity (too improvements, starting with domain specific vs. vague) and ambiguity adaptation of face classifiers from (context influences the way one can actor photographs to work with define the action). Ivan proposed a movies; and using voice classification definition of actions as a sequence of to determine an active speaker. steps and presented their work on Nevertheless, to achieve 100% instructional (“how to”) videos, that accuracy, Andrew expects that even automatically discovers this sequence better face descriptors and additional in an unsupervised manner. Ivan also story specific character context may be discussed recent work where actions necessary. are defined as the cause of change in The MovieQA challenge results on the object state. movie-clip (video + subtitles) based In summary, the workshop covered answering were presented next. The several key aspects of joint video and winning team from Tianjin University language understanding, as well as (Bo Wang, Youjiang Xu, Ziwei Yang, recent innovations that allowed the Yahong Han) represented video frames challenge winners to achieve and clips as a combination of words impressive results.

Workshop audience during Andrew Zisserman’s presentation on revisiting film character recognition in TV series and movies. Wednesday 12 John McCormac SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-Training on Indoor Segmentation? John McCormac is a PhD student at the Dyson Robotics Laboratory, Imperial College London. He spoke to us ahead of his poster today, which is co-authored with Ankur Handa, Stefan Leutenegger and Andrew J. Davison. “To produce a large dataset for indoor segmentation tasks”

The work is called SceneNet RGB-D. The through a 3D scene with random idea is to produce a large dataset for rotations and translations. What they indoor segmentation tasks. The real found is that it does not produce a good problem with collecting large-scale data trajectory for indoor segmentation is that annotations are tricky to come because humans follow a very by. It is expensive to manually annotate particular model of moving around. it and whenever you do, you can end up John explains one particular problem with lots of different errors. Synthetic they encountered: “It would pan across data is all there and people have done a a wall, so for large portions of a lot of work in figuring out how to go trajectory you are just staring at from a latent space to a photorealistic nothing but a wall, which is a terrible image. What they are trying to do in viewpoint and you never really get it this work is use that to give a ground when a human is using the camera. For truth for supervised training like . that, we came up eventually with a John tells us his main research focus is two-body trajectory system where you on semantically annotating 3D scenes. have one body that is an automated Whenever he started to do that, he point of focus that pretends to be the realised that there are very few thing the human is looking at, then the datasets that provide large video other point is the camera that is the sequences for semantic annotation. The viewpoint. The nice thing about that is one that he used was the NYUv2 that whenever you are looking at the dataset with about 1,500 annotated second body, it tends to be in the middle images. With this work, he was going of the room. It is very unlikely that you for a much larger scale. This dataset will get the body between you and the gives pixel-perfect annotations but at a wall, so it is just like a pretend moving much bigger scale. More like 5 million. point. After we did that, we started to When the team started working on it, get much better renderings. Besides they went for an approach of just that, there was a lot of engineering having a random trajectory through a work about just getting the scenes to scene. It was a billiard ball moving look right.” Wednesday John McCormac 13

John goes on to say that another approach. They wanted to do it problem they had was they were trying because it is simple, but John says to randomise everything to get large there are lots of ways they could scales, so they were just dropping improve upon it. He explains that at the objects that they had randomly moment there are no scene-grammars, positioned and sampled in the room. so you could do a sort of simulation, by What happens then is that every time which you make sure that if there is a an object is drops it randomly falls over, table there might be a chair nearby. so you end up with no chairs that are In closing, John says one of the big upright. To solve that, they moved the questions they had was: just how far centre of mass to be slightly below the can you go with synthetic data? They chairs, so that the physics simulation were benchmarking against the would naturally right some of them, standard VGG ImageNet weights to see and you get an upright chair as well. that if they train from scratch on The one core theme is randomisation. synthetic data that’s never seen a real They were trying to get randomisation image before, versus the normal VGG to produce big varieties of instructive weights that have been trained on training examples. They didn’t want to ImageNet, does the benefit of having produce lots of the same images the correct perfect ground truth pixel because they wanted to produce a big labelling, instead of classification, dataset. Randomly sampling the outweigh the fact that it’s synthetic and models, randomly producing the so not perfectly photorealistic? They trajectories, and then randomly evaluate this question in their paper. texturing the layouts, were all the core theme, so they did not have to If you want to learn more, you can manually annotate anything and could come to the team’s poster today train. (Wednesday) at 11:00 at Salone The randomisation is a baseline Adriatico.

A selection of images from the synthetic dataset along with the various ground truth available. Ground-truth camera poses for each trajectory are also available. Wednesday 14 Katie Bouman Turning Corners Into Cameras: Principles and Methods

of the corner, you can recover what is going on behind the corner, like where people are and how they are moving. Katie says that aside from how cool it is to be able to see around corners (it’s like a superpower!) there are many important applications, such as in automotive collision avoidance systems. For example, an intelligent car that warns you before an accident occurs. If a little kid darts out into the street, you might not see them behind the corner, but by analysing this shadow structure at the base of the corner you can have a few seconds of Katie Bouman is a postdoc, notice. having just graduated from her PhD at MIT, where she worked in Another application could be as part the Computer Science and of search and rescue, where you don’t Artificial Intelligence Laboratory. want to go into a dangerous building – if it is unstable or there is a hostage Katie is presenting a poster today, situation – but you want to see where which is co-authored with Vickie Ye, people are and how they are moving Adam B. Yedidia, Frédo Durand, behind the corner without physically Gregory W. Wornell, Antonio Torralba going behind it. (see full interview at page 6), and Katie explains: “It is an incredibly small William T. Freeman. Project website signal that we are trying to pick up. (with codes) is here. An overview Let’s say you are standing at a wall video is here on YouTube. and there is a corner straight ahead of The work asks the question: Can we you. If you have your shoulder up see around corners? We can’t against the wall, you can’t see naturally see what is behind a wall, anything behind that corner. As you but maybe the information about slowly move in a circle about that what is behind the wall is scattered in corner, moving away from the wall, the visible scene that we can see. you slowly see more and more of the Katie says that if you look at the edge hidden scene. Similarly, the light of the wall, the base of the corner, reflected from the ground at each of there is a little shadow there. That those points is summing up all the shadow is not exactly a shadow of the light from the hidden scene, more and wall, but it is a faint gradient. It is more as you go around the corner. caused by the light on the opposite Basically, as you move in a circle side of the wall. The hidden scene that around the corner, the light reflected you cannot see. By looking at the is an integral over the hidden scene, structure of that shadow by the base which is basically just summing up Wednesday Katie Bouman 15

different fractions of it. By simply situations and found that it was fairly taking a derivative around that circle reliable – indoor and outdoor; all around the corner, you can try to different kinds of surfaces – brick, recover people moving and objects in linoleum, concrete; varying weather the hidden scene.” conditions. She remembers that one They noticed, both through theory time it started raining in the middle of and experiment, that a person only their experiment and they were still affects about 0.1% of the intensity of able to see the track behind it: “I was the light that is reflected. It is incredibly surprised. At first, I thought incredibly small. Less than one we would not get anything, because intensity value of pixels that you get these huge raindrops started to on a standard camera. However, as appear on the ground and they are they are incorporating information way more visible than this tiny across the frame and you have imperceptible signal of the people multiple circles around that corner, moving. You can’t see that, but you you can average that information see these huge raindrops changing the together and still extract this tiny colour of the light. The reason we signal from each image. were able to recover the people is because we were doing this averaging Katie says they tried it in many

We construct a 1-D video of an obscured scene using RGB video taken with a consumer camera. The stylized diagram in (a) shows a typical scenario: two people—one wearing red and the other blue—are hidden from the camera’s view by a wall. Only the region shaded in yellow is visible to the camera. To an observer walking around the occluding edge (along the magenta arrow), light from different parts of the hidden scene becomes visible at different angles (see sequence (b)). Ultimately, this scene information is captured in the intensity and color of light reflected from the corresponding patch of ground near the corner. Although these subtle irradiance variations are invisible to the naked eye (c), they can be extracted and interpreted from a camera position from which the entire obscured scene is hidden from view. Image (d) visualizes these subtle variations in the highlighted corner region. We use temporal frames of these radiance variations on the ground to construct a 1-D video of motion evolution in the hidden scene. Specifically, (e) shows the trajectories over time that specify the angular position of hidden red and blue subjects illuminated by a diffuse light. Wednesday 16 Katie Bouman

over the frame and although the image of the hidden scene. By raindrops cause artefacts, you can still stacking up those reconstructions, see the trajectory of the person they can see the tracks of people behind it.” moving. They found that it was a very simple, Katie concludes by saying: “We have robust, computationally-inexpensive shown here that there is this rich algorithm that can be run in real time signal there, and that we can try to on standard RGB videos that you take leverage it and learn about hidden with your own webcam. You do need properties of the scene around us. It is some favourable conditions such as not exactly ready to be used in real- light in your scene. It can’t see behind world applications. Right now, we corners in a dark room. Other require it to be working on a static methods might do better in camera. We are looking at these tiny something like this, like a time-of- deviations, so we need the camera to flight kind of camera. This is a be still in order for us to pick out those completely passive approach, it is not deviations. One of the student shining any light into the scene and authors on the paper, Vickie Ye, is watching bounces. It is not sensitive working on a moving camera, that we to ambient light, all it requires is that can then attach to a self-driving car you have light in the scene and that kind of system. She has been fairly your objects are moving. successful in the beginning stages of Katie tells us they pinpointed the idea getting it working when you have a that all around us there are naturally- moving camera, and still being able to occurring cameras. Even though they recover those properties. She has also are not made with lenses and sensors been great at getting it working in a in the typical way you think of a live demo scenario so that it is camera, they still encode the pictures completely in real time. You can just of things that can’t be seen in an download the code from our web site, image. There is this ubiquitous corner run it off your webcam and see on almost every picture that you take around corners. She wrote this code, and that encodes information. All we have this live demo, you can they do is identify where the corner download it, or come see it at our is. In this work, they do rectify it so poster.” that they have camera calibration properties, so it looks like you’re looking straight down at the corner. After that, it is basically just a derivative, in which they also add a Gaussian smoothing prior to reduce If you want to learn more, come the effect of noise and make it a little to Katie’s poster today clearer. For every frame of a video, (Wednesday) at 11:00 at Sala they recover a one-dimensional Laguna. Wednesday Venezia Lido 17

One side of ICCV 2017

The other side of ICCV 2017 Improve your vision with

The only magazine covering all the fields of the computer vision and image processing industry

RE•WORK Subscribe

(click here, it’s free) Mapillary

A publication by Gauss Surgical