Move2hear: Active Audio-Visual Source Separation
Total Page:16
File Type:pdf, Size:1020Kb
In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021 Move2Hear: Active Audio-Visual Source Separation Sagnik Majumder1 Ziad Al-Halah1 Kristen Grauman1;2 1The University of Texas at Austin 2Facebook AI Research {sagnik, ziad, grauman}@cs.utexas.edu Abstract RGB 푆2 We introduce the active audio-visual source separation 푆1 problem, where an agent must move intelligently in order Mixed Binaural to better isolate the sounds coming from an object of in- 푆 퐿 3 푅 terest in its environment. The agent hears multiple audio 푆푖 Sound Source sources simultaneously (e.g., a person speaking down the 푆푖 Target Audio Target Monaural Agent’s Start Location hall in a noisy household) and it must use its eyes and ears Navigation Point to automatically separate out the sounds originating from a target object within a limited time budget. Towards this Figure 1: Active audio-visual source separation. Given multiple goal, we introduce a reinforcement learning approach that mixed audio sources Si in a 3D environment, the agent is tasked to trains movement policies controlling the agent’s camera and separate a target source (shown in green) by intelligently moving microphone placement over time, guided by the improvement around using cues from its egocentric audio-visual input to improve in predicted audio separation quality. We demonstrate our the quality of the predicted target audio signal. See text. approach in scenarios motivated by both augmented real- environment. In terms of visual sensing, one must see obsta- ity (system is already co-located with the target object) and cles to circumvent them, spot desired and distracting sound mobile robotics (agent begins arbitrarily far from the target sources, use visual context to hypothesize an out-of-view object). Using state-of-the-art realistic audio-visual simula- sound source’s location, and actively look for “sweet spots" tions in 3D environments, we demonstrate our model’s ability in the visible 3D scene that may permit better listening. to find minimal movement sequences with maximal payoff for audio source separation. Project: http://vision. In this work, we explore how autonomous multi-modal cs.utexas.edu/projects/move2hear. systems might learn to exhibit such intelligent behaviors. In particular, we introduce the task of active audio-visual source separation: given a stream of egocentric audio-visual 1. Introduction observations, an agent must decide how to move in order to recover the sounds being emitted by some target object, and Audio-visual events play an important role in our daily it must do so within bounded time. See Figure1. Unlike lives. However, in real-world scenarios, physical factors can traditional audio-visual source separation, where the goal either restrict or facilitate our ability to perceive them. For is to isolate sounds in passive, pre-recorded video [26, 30, arXiv:2105.07142v2 [cs.CV] 26 Aug 2021 example, a father working upstairs might move to the top of 1, 23, 53, 32, 19, 33], the proposed task calls for actively the staircase to better hear what his child is calling out to him controlling the camera and microphones’ positions over time. from below; a traveler in a busy airport may shift closer to Unlike recent embodied AI work on audio-visual navigation, the gate agent to catch the flight delay announcements amidst where the goal is to travel to the location of a sounding the din, without moving too far in order to keep her suitcase object [16, 28, 15, 17], the proposed task calls for returning in sight; a friend across the table in a noisy restaurant may a separated soundtrack for an object of interest in limited tilt her head to hear the dinner conversation more clearly, or time, without necessarily traveling all the way to it. scooch her chair to better catch music from the band onstage. We consider two variants of the new task. In the first, the Such examples show how controlled sensor movements system begins exactly at the location of the desired sounding can be critical for audio-visual understanding. In terms of object and must fine-tune its positioning to hear better; this audio sensing, a person’s nearness and orientation relative variant is motivated by augmented reality (AR) applications to a sound source affects the clarity with which it is heard, where the object of interest is known and visible (e.g., the especially when there are other competing sounds in the person seated across from someone wearing an assistive audio-visual AR device) yet local movements of the device sound separation and visual navigation motion policies; and sensors are still beneficial to improve the audio separation. In 3) we thoroughly experiment with a variety of sounds, visual the second variant, the system begins at an arbitrary position environments, and use cases. While just the first step in this away from the object of interest; this variant is motivated by area, we believe our work lays groundwork to explore new mobile robotics applications, where an agent detects a sound problems for multi-modal agents that move to hear better. from afar (e.g., the child calling from downstairs) but it is entangled with distractors and requires larger movements 2. Related Work within the environment to hear correctly. We refer to these scenarios as near-target and far-target, respectively. Passive Audio(-Visual) Source Separation. Passive Towards addressing the active audio-visual source sep- (non-embodied) separation of audio sources using solely aration problem, we propose Move2Hear, a reinforcement audio inputs has been extensively studied in signal process- learning (RL) framework where an agent learns a policy for ing. While sometimes only single-channel monaural audio how to move to hear better. The agent receives an egocentric is assumed [69, 71, 82, 40], multi-channel audio captured audio-visual observation sequence (RGB and binaural au- with multiple microphones [49, 93, 22]—including binaural dio) along with the target category of interest1, and at each audio [20, 83, 89] —facilitates separation by making the spa- time step decides its next motion (a rotation or translation of tial cues explicit. Using vision together with sound improves the camera and microphones). During training, as it aggre- separation. Audio-visual (AV) separation methods leverage gates these observations over time with an explicit memory mutual information [39, 24], subspace analysis [70, 56], ma- module and recurrent network, the agent is rewarded for trix factorization [54, 66, 30], correlated onsets [7, 43], and improving its estimate of the target object’s latent (monau- deep learning to separate speech [1, 26, 23, 53, 2, 19, 33], ral) sound. In particular, the reward promotes movements music [32, 27, 87, 90], and other objects [30]. While some that better resolve the target sound from the other distractor methods extract the audio tracks for all classes present sounds in the environment. Our approach handles both near- in the mixture [67, 79], others isolate one specific tar- and far-target scenarios. get [52, 80, 34, 35, 94]. Importantly, optimal positioning for audio source separa- Whereas prior work assumes a pre-recorded video as tion is not the same as navigation to the source, both because input, our work addresses a new embodied perception ver- the agent faces a time budget—it may be impossible to reach sion of the audio separation task, in which an agent can the target in that time—and also because the geometry of the see, hear, and move in a 3D environment to actively hear 3D environment and relative positions of distractor sounds a source better. To our knowledge, our work is the first to make certain positions relative to the target more amenable consider how intelligent movement influences a multi-modal to separation. For example, in Fig.1 the agent is tasked with mobile agent’s ability to separate sound sources. In addi- separating the audio from object S3. Here, going directly to tion, whereas existing video methods use dynamic object S3 is not ideal, as that position will have high interference motion to tease out audio-visual associations (especially for from the other audio sources in the scene, S1 and S2. By speech [26, 1, 23, 53, 19, 33]), our setting demands using moving around the kitchen bar, the agent manages to dampen visual cues in the surrounding 3D environment to move to the signal from S1 significantly due to the intermediate ob- “sweet spots" for listening to a source amidst competing stacles (walls) and, simultaneously, emphasize the signal background sounds. Finally, our task requires recovering the from S3 compared to S2, thus leading to better separation. target object’s latent monaural audio as output—stripping We test our approach with realistic audio-visual away the effects of the agent’s relative position, environment SoundSpaces [16] simulations on the Habitat platform [64] geometry, and scene materials. This aspect of the task is by comprising 47 real-world Matterport3D environment definition absent for AV separation in passive video. scans [11], along with an array of sounds from diverse hu- Visual and Audio-Visual Navigation. While mobile man speakers, music, and other background distractors. Our robots traditionally navigate by a mixture of explicit map- model successfully learns how to move to hear its target ping and planning [78, 25], recent work explores learn- more clearly in unseen scenes, surpassing baselines that sys- ing navigation policies from egocentric image observa- tematically survey their surroundings, smartly or randomly tions (e.g., [36, 63, 45]). Facilitated by fast rendering plat- explore, or navigate directly to the source. We explore the forms [64] and realistic 3D visual assets [11, 86, 72], re- synergy of both vision and audio for solving this task.