
Counting Moving People by Staring at a Blank Wall by Prafull Sharma B.S. in Computer Science, Stanford University (2017) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 @ Massachusetts Institute of Technology 2019. All rights reserved. Signature redacted A u th o r ........ ................................. Department of Electrical Engineering and Computer Science May 23, 2019 1Signature redacted C ertified by ............ e .................... William T. Freeman Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Spervisor Signature redacted C ertified by ............ ........ Fr6do Durand Professor of Electrical Engineering and Computer Science Thesis Supervisor Signature redacted Accepted by ................ j L-/ U Leslie A. Kolodziejski MASSAHUSETS OFTECNLGNSTITUTE Pr ofessor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students JUN 13 2019 CHIVES LIBRARIES Counting Moving People by Staring at a Blank Wall by Prafull Sharma Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2019, in partial fulfillment of the requirements for the degree of Master of Science Abstract We present a passive non-line-of-sight imaging method that seeks to count hidden moving people from the observation of a uniform receiver such as a blank wall. The technique amplifies imperceptible changes in indirect illumination in videos to reveal a signal that is strongly correlated with the activity taking place in the hidden part of a scene. We use this signal to predict from a video of a blank wall whether moving persons are present, and to estimate their number. To this end, we train a neural network using data collected under a variety of viewing scenarios. We find good overall accuracy in predicting whether the room is occupied by zero, one or two persons, and analyze the generalization and robustness of our method with both real and synthetic data. Thesis Supervisor: William T. Freeman Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Supervisor: Fr6do Durand Title: Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments I would like to first thank my advisors, Prof. William T. Freeman and Prof. Fr6do Durand. They have provided constant guidance throughout my thesis. Thanks to them for proposing the idea presented in this thesis and supporting me through the numerous failed attempts at solving this problem. I am grateful for having them by my side. Thanks to Prof. Gregory W. Wornell, Prof. Antonio Torralba, Prof. Yoav Y. Schechner, Prof. Jeffrey H. Shapiro, Prof. Vivek Goyal, Dr. Franco N.C. Wong for helpful discussions and questions which enabled me to think about the problem from different perspectives. I have been fortunate to work with Dr. Miika Aittala on this project. Numerous insights during our discussions have not only shaped the trajectory of this thesis, but also have positively impacted my way of approaching research. I am grateful to be a part of the vision and graphics community at CSAIL. It has been a fun learning experience to interact with members of the group. I would like to especially thank (in no particular order), Dr. Katie Bouman, Dr. Micha81 Gharbi, Dr. Zoya Bylinskii, Dr. Tzu-Mao Li, Dr. Ronnachai Jaroensri, Luke Ander- son, Lukas Murmann, Amy Zhao, Yuanming Hu, Camille Biscarrat, Caroline Chan, Spandan Madan, Dr. Jun-Yan Zhu, Dr. Guha Balakrishnan, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Vickie Ye, Adam Yedidia, David Bau, among many others. Thanks to the CSAIL administrators, staff at TIG, and CSAIL staff for making CSAIL an amazing space for conducting research. Thanks to Harshvardhan, Puneet Jain, and Abhijit Bhadra for contributing to the data collection related to this project and also supporting me otherwise. I would also like to thank all my friends at MIT and outside for injecting the required dose of fun in my life. I would like to express my gratitude towards my grandmother, mother, aunt, and my mentor, Sraddhalu Ranade for their unconditional love and support. This work would not be possible without any of you! 5 6 Contents 1 Introduction 13 2 Related Work 15 2.1 Passive methods. ... ...... ...... ...... ...... .. 15 2.2 Active methods. ..... ...... ...... ...... ....... 16 3 Overview 17 4 Method 21 4.1 Signal Extraction .... ..... ..... ..... ..... ..... 21 4.2 Space-Time Plots for Classification . ....... ...... ..... 23 4.3 Convolutional Neural Network Classifier ... ........ ..... 25 5 Results 27 5.1 Data Collection .... ..... ..... ..... ..... ..... 27 5.2 Human Accuracy . ........ ......... ........ ... 28 5.3 Classification Results ... ...... ....... ...... ..... 28 5.4 Stress Test Scenes. ..... ..... ..... ..... ..... .. 30 5.5 Analysis with Synthetic Data .. ..... ..... ..... ..... 31 6 Conclusion 35 7 8 List of Figures 1-1 (a) Our imaging setup is a camera pointed at a blank wall in a scene where people are possibly moving outside the observed frame. (b) The recorded video typically appears completely still to naked eye. (c) We subtract the static components of the video and amplify the weak remainder signal, resulting in a video of moving soft shadows and reflections (d) The video is then processed into a space-time plot that shows the large-scale movements, and (e) input into a neural network classifier that estimates the number of moving persons in the hidden scene. ........................ .... ........ .... .. 14 3-1 Example setup of a possible scenario where the camera is recording a blank wall, while people move in the room outside the line-of-sight of the cam era. ..... ....... ........ ........ .... 17 3-2 (a) Indirect illumination blocked by the person casts a soft shadow on the wall. (b) Light reflected by the person tints the wall with the color of the clothing. .... ........ ........ ....... ... 19 4-1 Top left: a representative frame of the seemingly static input video. Top right: a frame of the amplified residual video after subtracting the mean frame reveals blurry colored shadows and reflections. Bottom: a sequence of frames shows the motion of these features. ..... ... 22 9 4-2 Examples of space-time plots for zero, one, and two people cases. As one of the spatial dimensions has been collapsed, the space and time dimensions can now be viewed as 2D images with time advancing to- w ards right. ........ ............ ........... 24 4-3 An example of observed space-time plot for one and two person cases, and the corresponding space-time plots generated from synchronized ground truth videos of the hidden scenes. The plots cover a duration of 2 m inutes. ........ ........ ........ ....... 24 4-4 Convolutional neural network architecture used for classifying between 0, 1, and 2 persons. .... ............ ........... 25 5-1 Accuracy of the model across scenes. ....... ........... 29 5-2 Observation image of the setup for the stress tests. ....... ... 30 5-3 Two-dimensional flatland setup for synthetic data generation. The two blockers representing persons move back and forth along random directions. A 1D image is rendered at the observation plane, taking into account the mutual visibility between the blockers and the back wall acting as an illuminant. ..... .................. 31 5-4 (a) Samples of synthetic space-time plots for one person scenario. (b) Samples of synthetic space-time plots for two people, in the same sce- nario as (a). .......... ............ .......... 32 5-5 Classification performance for synthetic two-person video segments as a function of different relative motion parameters. ........... 33 10 List of Tables 5.1 Average model performance on the 3 classes .............. 29 11 12 Chapter 1 Introduction Non-line-of-sight (NLoS) imaging seeks to extract information about regions in a scene that are not directly visible for observation, due to for example occlusion by walls. This has important applications ranging from emergency response to elderly monitoring to the early detection of pedestrians for smart vehicles [1, 2J. Active methods rely on indirectly probing the hidden scene with e.g. pulsed lasers, time- of-flight sensors, or WiFi signals [3, 4, 5, 6, 7J. In contrast, passive approaches only use cameras. Existing passive methods have typically exploited occluders that act as accidental imaging devices for tasks such as looking around corners, inferring light fields, and computational periscopy [8, 9, 10]. In this thesis, we take an occluder-agnostic view to passive NLoS imaging, and demonstrate recovery of meaningful information - namely, the number of people mov- ing in a hidden room - from a video of a diffusely reflecting wall. A representative scenario is shown in Figure 3-1: the people are walking around in the hidden room space, and a camera records the wall. We are interested in the case where the people cast no direct shadows. Under this setting, the wall might appear entirely static to the naked eye. We show that amplifying subtle temporal changes in indirect illumination effects reveals a rich signal that is indicative of the activity in the hidden scene. This signal reveals information about the dynamics of subjects in the hidden scene. This signal can be used to determine the number of people in the hidden scene. To 13 Number of people in hidden scene: ElJ2 (a) (b) (c) (d) (e) Figure 1-1: (a) Our imaging setup is a camera pointed at a blank wall in a scene where people are possibly moving outside the observed frame. (b) The recorded video typically appears completely still to naked eye. (c) We subtract the static components of the video and amplify the weak remainder signal, resulting in a video of moving soft shadows and reflections (d) The video is then processed into a space-time plot that shows the large-scale movements, and (e) input into a neural network classifier that estimates the number of moving persons in the hidden scene. achieve this, we train a convolution neural network to classify between zero, one or two persons moving in the space, based on a short video input.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages39 Page
-
File Size-