Statement of Research Interests Sumit Basu

sbasu@ sumitbasu.net http://www.media.mit.edu/~sbasu Post-Doctoral Researcher, Microsoft Research PhD (September 2002), MIT Department of Electrical Engineering and Computer Science Thesis Advisor: Professor Alex (Sandy) Pentland, MIT Media Laboratory

My research objective is simple: I want to play with audio. I want to take auditory streams from the world, explore them, search through them, cut them apart, extract information from them, filter them, morph them, change them, and play them back into the ether. I want to find new and better ways to do these things and more, then teach them to others, both students and colleagues. I want to make audio useful and fun for a broad variety of communities. I want to build a myriad of interfaces, personal monitoring mechanisms, professional audio tools, toys, and instruments using these methods. In short, I want to do for audio what my colleagues in the and computer graphics communities have done with images and video.

Over the years, I have worked in human-computer interfaces, computer vision/graphics, , statistical modeling/machine learning, and of course computer audition. W hen I began my graduate studies, I was initially drawn to computer vision and Sandy Pentland‘s group at the MIT Media Lab because of their strong sense of play œ they were extracting interesting meta-information from visual streams, such as the location of a user‘s head and hands, and using it for interactive applications like playing with a virtual dog. I joined their efforts and spent several years working on computer vision and interactive vision systems. However, after this time, I felt myself begin to drift towards audio. I sensed how much richness there was in auditory streams and what amazing potential they had for this same sort of play. Furthermore, it seemed that the work in this area was minimal: there was no equivalent of ACM‘s SIGGRAPH (the major computer graphics conference/journal) or IEEE‘s CVPR (a prominent computer vision conference) for audio œ most everything in the audio community was heavily oriented towards speech recognition or low-level signal processing. Regardless, I decided to take my chances and play in the audio world.

Because of the winding path I‘ve taken, my research has spanned the three areas I feel are critical to sophisticated auditory play œ machine , human-computer interfaces, and machine learning. Machine perception yields us powerful mechanisms to detect and track low-level features, but machine learning is critical to build models and recognize higher level patterns from this data. Also, the results of learning can be used to guide the parameters for low-level perceptual mechanisms. Furthermore, while analysis and detection tasks are interesting, they become far more so when they can be used in a system that involves a human user. Though the real-time aspects of developing human-computer interfaces often makes the lower level tasks more difficult, I believe it makes for far more engaging and powerful systems. Things get even more interesting when these interfaces are improved by learning mechanisms: over time, the interfaces can adapt to the users‘ behavior or help the users adapt to a behavioral goal. In order for this to be successful, the system must be able to perceive the resulting changes in the users‘ behavior, and thus the cycle begins anew.

I‘ve had the good fortune to work in many aspects of these areas: I began doing projects in image processing/computer vision in 1994, when I was working on image enhancement algorithms with hyperacuity sensing as an undergraduate intern at Xerox PARC. Since then, I have worked on optical flow regularization for 3D head-tracking, finite element models (as physical priors for deformable meshes), maximum likelihood tracking of deformable meshes, mesh-based smoothing, and more. My work in machine audition began as an undergraduate researcher at MIT working on speaker identification/clustering, and as a graduate student I have worked on beamforming, pitch estimation, speech detection, source localization, speaking rate estimation, speaking style characterization, and conversational feature extraction. My interface projects began at PARC in 1993 developing applications for the ParcTAB (a pre-palm pilot handheld). I have since worked on the ALIVE (Artificial Life Interactive Virtual Environment) system, audio interfaces for wearable computers, "smart" headphones, the Facilitator Room project, analysis-augmented teleconferencing, the sonovar system for sound recombination/performance, and now video and audio browsing. My machine learning background also began at PARC, where I was exploring handwriting recognition algorithms for Unistrokes (a precursor to the Palm Pilot's handwriting system). Since then I have worked on learning physics for deformable meshes, various applications of dynamic programming, HMM's, belief propagation in dynamic Bayesian networks (DBN‘s), both exact and approximate, and more. Many of these projects are described in more detail at http://web.media.mit.edu/~sbasu/projects.html .

My thesis is a good example of what I consider auditory play. It was inspired one night in 1998 when I was out having dinner and overheard a nearby couple speaking in another language. Though I had no idea what they were saying, it was clear that they were probably on one of their first few dates. This made me wonder what else we could figure out from only the tone of voice and the pattern of interaction, and the computer scientist in me wondered what subset of this I could train a machine to understand. I dubbed the topic "Conversational Scene Analysis" and began to think seriously about what we could infer about a conversation without understanding any of the words. The first steps seemed relatively simple: finding the conversational —scenes,“ i.e., regions during which one person or another is dominating, finding what type of conversation is occurring, and quantifying how a person in a given conversation is acting relative to their baseline speaking style. Before I could get to these problems, though, there was an array of feature estimation and problems to be dealt with œ who was speaking when and how, i.e., with what features (pitch, energy, speaking rate: see slide 6 at http://web.media.mit.edu/~sbasu/talks/defense.html). It was important to me that my methods could obtain these features from distant microphones as well as the cumbersome headsets of the speech community, which made many of these tasks significantly more challenging. Along the way, I also found a powerful method for determining whether two people were engaged in a conversation based only on their interaction pattern.

I then moved to characterizing conversational styles and patterns with data from the LDC callhome database, a public repository of long telephone conversations between friends/family members. W hat I found was fascinating œ different conversations had markedly different signatures in terms of prosodic and interaction parameters. One of my favorite examples was a conversation in which a young woman in Finland is speaking with her parents back in the States. W hen listening to the conversation, it‘s clear that she is much more excited about talking to her mother than her father. W hat delighted me was that this difference was quite clear in my features as well (see slide 41). As I had hoped, it seemed possible to quantitatively characterize the different interaction styles we take on with different people. This has obvious applications in surveillance, but more interestingly, I see this becoming the basis for a very useful —social feedback“ mechanism: we all know we treat different people differently, and while the differences are clear to a third party, it‘s difficult for us to gauge them ourselves. Having them laid out for us quantitatively can help us assess our relationships and work towards improving them.

Furthermore, I found that the interaction patterns for this database fell into a continuum of categories which we could use to characterize and browse conversations (slide 43). I believe this type of categorization is critical if we ever wish to browse the hours, days, and months of audio from our lives, a problem I am now working on at Microsoft Research under the title —Personal Audio“. Basically, since social interactions are key to our existence, I feel that our conversations form a large part of the story of our lives. As a result, I‘m investigating the possibilities of recording all audio on our bodies all the time and then being able to browse back through it using my conversational scene analysis methods, both as a means for retrieving information and as a way of keeping a diary of our lives. I‘m also investigating the possibilities of using this kind of information as a health monitoring mechanism for manic-depressive patients. Along with Havard Medical student Vikram Kumar and Massachusetts General Hospital doctor/professor Roy Perlis, I‘m involved in a drug study to assess whether conversational features can be used to assess a patient‘s mental state. If this is successful, it could drastically increase the effectiveness of treatments, since the patients could be monitored on a daily/hourly basis, thus allowing fine-tuning of medications and intervention. It could also greatly reduce the cost of mental health care by reducing the number of doctor visits. Finally, it could help the patients take control of their own illnesses by providing them with a personal feedback tool, much like the cholesterol meters and blood sugar monitors have done for patients with high cholesterol and diabetes.

For my future work, I see a broad variety of topics and application areas. The first is smart/aware environments, where the home, office, and car are transformed into active spaces that can analyze/help/guide the user. I have a strong belief that if an environment could understand the types of interactions that are taking place among its inhabitants, it could do a better job of acting appropriately, and perhaps more importantly assess long-term trends in our behavior. This area also includes distance learning and advertising kiosks, where it is important to assess the reactions/interactions of a group of people without seeing them firsthand. The next is wearable computing, an exceedingly rich framework for first-hand perceptual sensing and private user interfaces to assess and use this sort of information. The personal audio work falls directly into this area. The third area is communication channels (cellphones, videoconferencing, conference room, etc.) which also allow for rich short and long-term sensing, as well as the opportunity to give the users feedback about their conversational state (do they sound excited, bored, depressed, confrontational) and to mediate the conversation. Fourth, I am also interested in medical applications such as the mental state monitoring I described above - if we have rich sensory information about people on a 24-hour basis, both from wearables and aware environments, how can we use this to better track and improve people's health? I believe that there are many aspects of our health that are difficult to diagnose on a spot-check at the doctor's office, but may be easy to find given months of behavioral data or even hours of data from an intimate (i.e., wearable) perspective. Finally, I want to use the signal processing and pattern recognition techniques I work on to develop new and interesting forms of musical expression. I have been composing and performing for a while, and more recently my experimental work with Brian Clarkson on the Sonovar project (see http://www.media.mit.edu/~sbasu/music.html) has led me to a number of new musical projects. Many of the features we can extract from sound are beautiful in and of themselves, both visually and auditorily, and I look forward to developing new ways to use these new brushes on a musical canvas.

So why do I call it play? W ell, to put it simply, it‘s not what the traditional auditory researcher in computer science —should“ be interested in. W henever I give a talk to my colleagues in the speech recognition area, they are very interested in the techniques and results, but are utterly amazed when I start playing samples from the LDC databases to illustrate the conversational patterns. Several times I have heard people say, —W ow œ I never actually listened to that data,“ even though they run their code on it on a daily basis. To me, this is symptomatic of a disturbing trend in the auditory community œ they are so oriented towards speech recognition and word error rates that they have become removed from the richly textured audio streams of the real world. My work is about going back to those streams, appreciating them for what they are, and harnessing their power in ways that may have nothing to do with speech recognition. Some of these results may lead to important applications for health monitoring, personal feedback, distance education, music, and human-computer interfaces œ but some of them will simply be play. And that‘s just fine with me.