Real-Time Target and Pose Recognition for 3-D Graphical Overlay by Jeffrey M
Total Page:16
File Type:pdf, Size:1020Kb
Real-Time Target and Pose Recognition for 3-D Graphical Overlay by Jeffrey M. Levine Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 1997 ©Massachusetts Institute of Technology, MCMXCVII. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. Author .................................. Department of Electr afEngineering and Computer Science -- /~/ MVIay 23, 1997 Certified by ................... / .... - ......... I .................... .. Alex P. Pentland Toshiba Professor of Media Arts and Sciences Thesis Supervisor Accepted by ........................... A.._-.. .Arth.•• Arthutr. Smith Chairman, Departmental Committee on Graduate Theses Real-Time Target and Pose Recognition for 3-D Graphical Overlay by Jeffrey M. Levine Submitted to the Department of Electrical Engineering and Computer Science on May 23, 1997, in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science Abstract Applications involving human-computer interaction are becoming increasingly popu- lar and necessary as the power of computers is integrated into almost every corner of business and entertainment. One problem with the current interfaces, however, is that the user is forced to manually input information about their current task, or glance away from what they are doing to read a screen. One solution to this prob- lem of efficiency is to allow the computer to automatically sense the environment around the user, and display data in the user's field of view when appropriate. With comptuer graphics drawn over the user's field of view, the appearance of certain ob- jects in the scene can be modified, or augmented, to either point out critical regions, or provide on-line information about an object whenever the object is encountered. Two solutions to this problem are proposed: one requires each special object to be marked with special color "tags", while the other tries to detect the presence of the object directly with no artificial additions to the appearance of the object. In both systems, the 3D pose of the object is tracked from frame to frame so that the overlay will appear to be solidly attached to the real object. The overlayed graphics eliminate the distraction of consulting a separate screen, and provide the information exactly when it is most useful, thus creating an efficient and seamless interface. Thesis Supervisor: Alex P. Pentland Title: Toshiba Professor of Media Arts and Sciences Acknowledgments I'd like to thank the many people who made this thesis possible. First, I'd like to thank Sandy Pentland for his direction and advisements, without which this whole project would not have been possible. Also, Thad Starner deserves a great deal of credit for introducing me to AR and helping me in countless ways, and in countless all-nighters, over the past couple of years. I'd also like to thank Steve Mann who introduced me to his "Video Orbits" algorithm, and spent many an all-nighter as well getting portions of the system to work. Ali Azarbayejani also deserves credit for his much needed advice when I was just beginning to learn about 3D pose estimation. Finally, on the technical side, I'd like to thank Bernt Schiele for visiting the Media Lab and answering my many questions on his object recognition software. Of course, beyond technical help there are many others that enabled me to complete this document. Thanks to Wasiuddin Wahid, my officemate for not letting me go home at night when I was tired, and making me take breaks when I was frustrated, and listening to me yell at him all the time when I was angry. Gayla Cole deserves a thanks beyond words for helping me keep my sanity throughout everything that has happened in the past months, for getting me to work at the god awful time of 9am, and for harping on me to get work done and then corrupting me with late night viewings of "Days of our Lives". Also, without the harsh words of Scott Rixner I might've been even more slack then I am now. I also would like to thank Jason Wilson for his ever-so-true advice on thesis writing, getting jobs, women, and lots of other stuff that I can't write here. Finally, Karen Navarro deserves thanks for listening to my awful sense of humor, and showing me the truth behind the mystery of the Yeti. I like to thank everyone in Vismod, as well. All of you have been extremely supportive and I'm glad to have worked in a group of such capable and frendly people. Special thanks to the person who builds the recognition system that can finally count how many beverages I consume at each X-hour. Aside from my thesis, there are a few more people that I haven't mentioned above, but have helped me tremendously through my MIT career. I don't know where I'd be without all of you, so a special thanks goes to Robin Devereux, Doug DeCouto, Anil Gehi, and Peter Yao. And last, but definitely not least, thanks to my parents, for getting me to this point in my life, and always being there when I needed them most. Contents 1 Introduction 8 1.1 AR in a Marked Environment ................. ..... 9 1.2 AR in a Natural Environment ....................... 9 1.2.1 Object recognition using Multidimensional Receptive Field His- tograms .............................. 10 1.2.2 V ideo O rbits ...... ........ ....... 11 1.3 Overlay .... ......... .... .. ..... .... ...... .. 12 1.4 The Need for Real-Time ......................... 13 1.5 The Apparatus ...... .. ..... .. ..... .... .. 16 1.6 Applications ....................... .... .... ....... 16 2 Target Detection 19 2.1 Detection in a Marked Environment . .................. 19 2.2 Detection in an Unmarked Environment . ............... 21 2.2.1 Multidimensional Receptive Field Histograms ... ... .. 21 2.2.2 Gaussian derivatives ................... .... 23 2.2.3 Normalization .......... .............. .. 24 2.2.4 Histogram Comparison ...... ... ....... .. .. 25 2.2.5 Probabilistic Object Recognition .. ... .... .... .. 25 2.2.6 Putting it all together . .. .. .. .. .. 27 3 Pose estimation and tracking 29 3.1 Pose estimation with corresponding features . ............. 29 3.1.1 How to Cheat ........ 3.2 Video O rbits ..... ........ 3.2.1 Optical Flow ........ 3.2.2 Projective Flow ....... 3.2.3 Parameter Estimation ... 3.2.4 Relating approximate model to exact model 3.2.5 Estimation Algorithm ... 4 Performance 39 4.1 Speed ......... ....................... ... 39 4.2 Recognition Rates ............................. 40 4.3 Tracking Accuracy ............................ 41 5 Future work 43 5.1 Algorithm Improvements .. ........... ........... 43 5.2 Hardware limitations .. ............... 44 5.3 Scalability .. ...... ...... ..... ...... ...... .. 45 6 Conclusion 46 List of Figures 1-1 Alan Alda wearing the prototype AR apparatus. Snapshot taken dur- ing the filming of Scientific American Frontiersat the MIT Media Lab 15 1-2 A printer with an older version of the tags that use LEDs ....... 17 1-3 The same printer with an overlay showing how to remove the toner cartridge ................... .. 17 2-1 The viewpoint of an anonymous Augmented Reality user looking at the author (also wearing an AR rig) surrounded by multiple 1D AR tags. 20 2-2 A small movie is overlayed in the upper right corner of this large screen television displaying one of the demonstrations, in this case, a sign language recognition system, that could be run on that station. 20 3-1 Method of computation of eight parameters p between two images from the same pyramid level, g and h. The approximate model parameters q are related to the exact model parameters p in a feedback system. From M ann . .. .. .. .. .. .. ... ... ... ... 36 3-2 The shopping list is being overlayed over the cashier. Notice that even as the cashier moves out of the scene the tracking can continue to function using cues from the background. ........ ......... 37 List of Tables 4.1 Average and Maximum Algorithm Speeds in Frames per Second . .. 40 4.2 Histogram Comparison (X2) with 100 3D rotation changes ...... 41 4.3 Probabilistic Object Recognition . .................. 41 Chapter 1 Introduction Applications involving human-computer interaction are becoming increasingly popu- lar and necessary as the power of computers is integrated into almost every corner of business and entertainment. One problem with the current interfaces, however, is that the user is forced to manually input information about their current task, or glance away from what they are doing to read a screen. However, computers no longer have to be restrained to displaying only to conventional monitors. One can imagine wearing a pair of glasses that not only lets light pass through from the outside world, but also displays the output from the computer right over the user's field of vision [4]. A window containing either text or graphics could appear to be hanging in space in front of your eyes, thus allowing you to access the data without taking your eyes away from what you are doing. This is useful when walking down a crowded street where staring down at a computer screen could be dangerous, or during a conversa- tion where constant distractions to consult a computer might be considered rude or annoying. Now imagine that these glasses have a miniature camera built in so that the computer can see everything the user can see[11]. With this addition, the computer can not only display data that is directly in your line of sight, but it can also display data directly related to the environment around you. This combination allows the computer to see what the user sees and correspondingly modify the appearance of the real world in a system referred to as Augmented Reality, or AR for short.