[best of THE WEB] Zhu Liu and Michiel Bacchiani

TechWare: Mobile Media Search Resources

problem of multimedia search has wid- Users might use speech to input a text Please send suggestions for Web ened given the additional input modali- query, or they might use their cameras resources of interest to our readers, ties now commonly available through to input an image-based query. Touch proposals for columns, as well as the sensor rich clients. screens make multimodal consumption general feedback, by e-mail to Dong Yu (“Best of the Web” associate edi- This column first reviews the recent of complex results an obvious presenta- tor) at [email protected]. progress in the field of mobile media tion method. The swift change in the search and then discusses the enabling problem setting, the size and nature of media indexing and query technologies. the content indexed, and the prolifera- n this issue, “Best of the Web” focus- Some useful online links and resources tion of devices used to engage in the es on mobile media search, which will be listed for the interested readers content have created a new and exciting has enjoyed rapid growth in con- for further research. Unless otherwise field of multimedia search. sumer demand and hence receives a noted, the resources are free. The momentum of this technology is lot of attention from content provid- not only evident in the many products Iers and device manufacturers. The com- MOBILE MEDIA SEARCH produced by industry in this space, it is bination of a large increase in Not being mined, a diamond is simply an also evident in the academic communi- computational power and an always-on ordinary stone buried in the dirt. That is ty. Recently, the “Mobile Media Search: broadband connection available every- also true for multimedia content. With Has Media Search Finally Found Its where at all times makes for an ideal the growing availability of low-cost mul- Perfect Platform?” panel series at the device for all our social, business, and timedia capture devices (video, image, ACM International Multimedia 2009 entertainment needs. In addition, such and sound) and the availability of easy- Conference and ICASSP 2009 addressed devices provide a wide array of multime- to-use video editing software, shooting a this field. In these two panels, leading dia sensors such as a high-quality cam- video clip and publishing it online is no experts in the field from both industry era, a touch screen, audio-capturing longer a complicated task. For example, and academia shared their opinions on abilities, and a global positioning system in the last year, video material uploaded mobile media search. (GPS). This makes the devices not only to the YouTube video sharing service has In the following two sections, we suited for multimedia consumption but increased from about 20 h/min to about introduce two dimensions of mobile also a powerful platform for multimedia 35 h/min. This explosion of content media search applications and systems: content generation. The relatively small makes the problem of multimedia search multimodal query interfaces and multi- form factor of the mobile devices in com- both more pressing as well as challeng- media retrieved results. parison to desktop machines also brings ing. As a result, the area of multimedia challenges as it is much more difficult to search has received a lot of attention as MULTIMODAL QUERY IN MOBILE provide textual input to the device. On it moved from a text-only retrieval prob- MEDIA SEARCH the other hand, the wide availability of lem to a much richer and multifaceted The dominant search interface both on the multimedia sensors makes additional problem. It brought in several other desktop as well as mobile devices is input modalities more natural and many research fields like speech recognition, based on text-query input. Such queries applications make good use of the added image/video processing, natural lan- might be keywords, key phrases, or natu- opportunities this provides. As a result, guage understanding, computer vision, ral language questions. Alternatively, the importance of multimedia search has and music identification to better under- some search engines provide a hierarchi- risen considerably given the explosion of stand the content being indexed. In addi- cal categorization of the available con- devices attempting to locate content and tion, the group-based content generation tent and provide the users an interface to the explosion of content generation and consumption has added user signals navigate that hierarchy. Contrary to itself. In addition, the spectrum of the that can be used in retrieval (e.g., a video desktops, input of a text query can be might “go viral”). And the many sensor- quite challenging on a mobile device due rich clients make the use of multimodal to its small form factor. On the other Digital Object Identifier 10.1109/MSP.2011.941095 Date of publication: 15 June 2011 input a more realistic usage setting. hand, unlike desktops, mobile devices

IEEE SIGNAL PROCESSING MAGAZINE [142] JULY 2011 1053-5888/11/$26.00©2011IEEE are equipped with a large array of sen- social network updates. The Dragon response (QR) barcodes or written text sors making alternate input modalities a search app searches a variety of top that can be automatically analyzed and widely available option. This section Web sites using spoken queries. transformed to a query. A barcode read focuses on the alternate interfaces might lead to a URL or a product mobile devices provide. Specifically, we MUSIC-BASED SEARCH search. A text snippet might be a query highlight voice-based, music-based, Often while on the go, we hear a catchy in itself or be the input to machine image-based, and multimodal-based tune or a song that we recognize but translation. In other cases, the image- query input methods. can’t quite place. This presents a search based query can be a complex scene problem in a modality that isn’t covered such as a photo of a product or a char- VOICE-BASED SEARCH well by classical text-based information acteristic building. Several companies A major shortcoming of mobile devices retrieval. One could query based on part provide image-based query processing is the effort needed to input a textual of the lyrics but no such option exists for as apps on mobile devices since they query to the device. Several companies instrumental music, and for many songs universally are equipped with high- are addressing this issue by providing a the lyrics that one might remember are quality cameras, making the image cap- speech recognition-based solution. This not very descriptive. Given the availabili- ture an easily accessible modality to in itself is a hard speech recognition ty of a microphone on all mobile devices, feed such searches. problem. Queries have a very large capturing a snippet of the song becomes ■ Google [www.google.com/mobile/ vocabulary. They tend to be short, mean- a possibility and several companies pro- goggles (mobile app)]: The Goggles ing that a language model provides a flat vide a music-based query search. In fact, app uses camera input to search dif- prior to the search problem. The user improvements in the robustness of the ferent kinds of objects and places, population of this technology is very technology allow the users in cases to including text, landmarks, books, large. And the most common use of the simply hum the tune as the input query contact info, artwork, wine, and logos. technology is on-the-go, likely in a noisy to the search and successfully retrieve ■ SnapTell [www.snaptell.com (mobile environment like on a city street or in an the tune in question. app)]: The SnapTell app finds informa- airport. The application also provides ■ Shazam [www.shazam.com (mobile tion about any book, DVD, CD, or modeling opportunities as the query app)]: Shazam identifies recorded video game using a picture of its context might be available from the text music by listening to the song for cover or the UPC/European article input location (the text input field and about 5–7 s. After recognizing the number (EAN) barcode image. surrounding Web page) and the profile of song, it returns the song name, art- ■ Kooaba [www.kooaba.com (mobile the user is likely consistent as most ist, album, link to buy the song, and app)]: The kooaba visual search app mobile devices are owned and operated YouTube video if available. recognizes media covers, including by a single user. ■ SoundHound [www.soundhound. books, CDs, DVDs, games, newspa- ■ Vlingo [www.vlingo.com (mobile com (free and paid mobile apps)]: pers, and magazines, and provides rel- app)]: As a voice recognition app, Similar to Shazam, SoundHound also evant product information. Vlingo takes voice commands to allows users to search music by sing- send e-mail and text messages, dial ing/humming the song or speaking a MULTIMODAL-BASED SEARCH the phone, update Twitter and title or band name. Lyrics, artist, and Finally, some queries are most naturally Facebook statuses, and search for other relevant information are expressed by a combination of modali- information on the Web. retrieved. ties. Location is easily expressed by a ■ Google [www.google.com/mobile/ ■ MusicID [musicid2.com (paid GPS or gesturing on a map whereas a voice-search (integral to Android and mobile service)]: MusicID identifies named location is best described by a as an app)]: allows users music by hearing it. Additional fea- text query. Furthermore, in cases where to input search queries by voice. In tures include viewing lyrics and artist multiple locations are the results of a addition, the app allows dictation of a biographies and downloading songs. query, gesture to pinpoint a particular text message and voice-controlled result among the set will be a natural navigation and calling. IMAGE-BASED SEARCH choice. Various companies have provid- ■ Tell me/Microsoft [www.microsoft. Similar to music-based search, some ed map-based search applications that com/en-us/Tellme (mobile app)]: queries are hard to express in text but allow a combination of text- and ges- Users are able to use this app to easy to express in the form of an image. ture-based query inputs to location- make a call, open an application, Artwork such as paintings or album and related search results. and search the Web. book covers are easily captured with a ■ AT&T (www2.research.att. ■ Dragon [www.dragonmobileapps. camera but much harder to accurately com/~johnston): This link introduces com (mobile app)]: The Dragon dic- describe in text. In other cases, a visual the concept of multimodal interfaces tation app allows users to speak and can directly encode information such as and describes some early work of instantly see the message, e-mail, or universal product code (UPC) or quick Michael Johnston at AT&T in this field.

IEEE SIGNAL PROCESSING MAGAZINE [143] JULY 2011 [best of THE WEB] continued

record chapters of books and make those available in the public domain as shared audio files. ■ MIT [ocw.mit.edu/index.htm (mobile app)]: The MIT LectureHall app provides a dynamic environment for accessing MIT OpenCourseWare video lectures on the go. ■ Cooliris: [www.cooliris.com (mobile app)]: Cooliris presents images and videos on an infinite three-dimensional (3-D) wall that lets the user enjoy the content without clicking page to page. It allows the user easily search across YouTube, Mobile search. Cartoon by Tayfun Akgul ([email protected]). Google, Flickr, Picasa, deviantART, and Yahoo. The user is able to control ■ Google Maps for Mobile (www. Content creators like The New York the interface by simple gestures and google.com/mobile/maps): Provides Times, NBC, ABC, and others have like- tilting the mobile device. detailed road and satellite maps, pos- wise reached out to mobile consumers ■ Facebook [www.facebook.com sibly with driving directions and loca- by supplying native mobile applications (needs free subscription)]: The tag tion of business listings on the map for easy access to their content. Within suggestions function in Facebook is a as markers. Search input is in the these applications, users can find news useful tool that makes face tagging no form of text queries or speech. The articles as a mix of text, photos, and longer a chore. When the users upload output is consumed mainly by ges- audio/video. Alternatively, for larger a set of photos that feature the same ture to move the map or get details pieces of content like audio books, pod- friends, Facebook groups together for a marked business. Driving direc- casts, or feature-length movies, desktop faces and suggests the name based on tions can also produce speech output. media management software might be existing tags using face recognition ■ AT&T/YP [www.yellowpages.com/ used to acquire the content and syn- software. products/ypmobile (mobile app)]: The chronize with the mobile device and ■ YouTube (m.youtube.com): YP app provides voice- and map-based then makes the media available for YouTube provides an application that search for the users to find closer mobile consumption. Examples of such allows search and playback of video results from millions of listings applications are iTunes for iPhone/iPod content on mobile devices as well as nationwide. Search results can be and Zune for Windows-based devices. user interaction in the form of post- shared via text, e-mail, Facebook, and The primary types of multimedia ing comments. In addition, YouTube Twitter. that mobile users are looking for are allows users to upload videos they audio/music and image/video. In this capture with their mobile device MULTIMEDIA CONTENT section, we review a few mobile appli- directly from their mobile device. IN MOBILE MEDIA SEARCH cations that are specialized in these ■ AT&T [www.att.com/u-verse (for Multimedia content consumption by categories. U-verse service subscribers)]: The mobile users is growing exponentially. ■ Pandora [www.pandora.com (free U-verse mobile app from AT&T lets According to the CISCO Global Mobile registration required)]: Based on a U-verse customers search and down- Data Traffic Forecast (http://www.cisco. seed of the user preferences on some load popular shows from the mobile com/en/US/solutions/collateral/ns341/ artists/songs, the application uses col- library to watch on a mobile device. ns525/ns537/ns705/ns827/white_paper_ laborative filtering built on the user ■ Netflix (netflix.com): Netflix offers c11-520862.pdf), mobile video traffic population to create a user-tailored an Internet subscription service for will exceed 50% of mobile network traf- inventory of music. enjoying movies and full episodes of fic for the first time in 2011. In light of ■ Cross Forward [www.crossfor- TV shows. this trend, major Web search engines ward.com (mobile app)]: The audio- ■ Blinkx (m.blinkx.com): With an such as Google, Bing, Yahoo, and Baidu, books app offers more than 3,500 index of over 35 million h of searchable have significantly boosted their support classics for free. The paid version has video, blinkx’s video is for multimedia content search on a growing collection of premium one of the largest on the Web. At mobile devices. Support is provided audiobooks. Many of the recordings blinkx, the users can create personal either through customized Web portals are taken from the LibriVox project video playlists and build a customized or dedicated mobile applications. (librivox.org), where volunteers video wall for their blog page.

IEEE SIGNAL PROCESSING MAGAZINE [144] JULY 2011 ■ TRUVEO (www.truveo.com): allowing the de sign and training from ■ CCExtractor (ccextractor.source- TRUVEO indexes more than 300 data of a speech recognition system. forge.net): A closed-caption extracting million videos from thousands of ■ OpenFst (www.openfst.org): Full- tool for MPEG files. Supports H.264 sources across the Web. Users can featured library of finite state trans- streams and CEA/EIA 708/608 stan- search videos, optionally within cat- ducers manipulation algorithms that dards. Produces closed captions in egories such as news, TV shows, can be used to implement large several formats like SubRip, SAMI, movies, music, and celebrities. vocabulary speech recognition sys- and others. tems as well as natural language pro- ■ ComSkip (www.comskip.org): MEDIA SEARCH TECHNOLOGIES cessing systems. Analyzes MPEG files to detect com- The most commonly used features for ■ SRILM [www-speech.sri.com/proj- mercial breaks based on silence, making multimedia content discoverable ects/srilm (free for academic use black frame, logo, aspect ratio is the tag and metadata associated with only)]: Toolkit for estimation of statis- change, and closed-caption informa- that content when it is made available on tical language models with applica- tion features. the Internet. For example, pictures tions to automatic speech recognition ■ ffmpeg (www.ffmpeg.org): Tool for might have captions, videos might have and natural language processing. transcoding and editing videos. In metadata text describing the content, or addition it can be used as a front-end the content might be annotated with cat- IMAGE PROCESSING video parser in a video analysis sys- egory tags. State-of-the-art search ■ OpenCV (opencv.willowgarage. tem. engines increasingly also rely on content- com/wiki): Library optimized for based media processing techniques to real-time computer vision applica- MACHINE LEARNING provide a richer representation of the tion such as image processing, Machine learning and pattern recogni- indexed content. For example, speech image/video file IO, image segmen- tion are the key technologies that media recognition can provide a text signal tation, object detection and track- content analysis systems rely on. The fol- from the spoken content of videos and ing, image pyramids, geometric lowing are a few general-purpose image analysis might provide richer con- transforms, camera calibration, machine learning tools that can be used tent tagging. In this section, we briefly stereo image processing, and as the building blocks for content analy- cover the enabling technology in four machine learning (C/C11 and sis systems: areas: speech/audio processing, image Python). ■ Weka (www.cs.waikato.ac.nz/ml/ processing, video processing, and ■ MATLAB [www.mathworks.com/ weka): A collection of machine learn- machine learning. products/image (requires license)]: ing algorithms in Java for data mining Tools for image processing and visu- tasks. It contains data preprocessing, SPEECH/AUDIO PROCESSING alization, including image enhance- classification, regression, clustering, ■ SOX (sox.sourceforge.net): SOX ment, image deblurring, feature association rules, and visualization. allows file format conversion among detection, noise reduction, image ■ LIBSVM (www.csie.ntu.edu. commonly used containers. In addi- segmentation, spatial transforma- tw/~cjlin/libsvm): A library for sup- tion, it allows signal manipulations tions, and image registration. port vector machines implementing such as sample rate conversions, ■ ImageMagick (www.imagemagick. support vector classification, regres- channel multiplexing, and filtering. org): Tools for creating, editing, and sion, and distribution estimation, as ■ Wavesurfer (www.speech.kth.se/ converting images in various image well as multiclass classifiers. wavesurfer): Visual analysis of audio formats. ■ Torch (www.torch.ch): Torch is a recording allows navigation, play- ■ PittPatt [www.pittpatt.com machine learning library written in back, segmentation, labeling and fun- (Software development kit (SDK) is C++, and distributed under a BSD damental pitch, and frequency not free)]: SDK implements face-find- license. (spectral) analysis. ing, tracking, and recognition algo- ■ Transcriber (trans.sourceforge. rithms. AUTHORS net): Transcriber is a complex labeling Zhu Liu ([email protected]) is a tool intended for speech transcription VIDEO PROCESSING principal member of technical staff in to label training data for automatic ■ TRECVID (trecvid.nist.gov): the Video and Multimedia Technologies speech recognition system develop- Developed in light of the video and Services Research Department of ment. retrieval task sponsored by NIST. AT&T Labs–Research, USA. ■ MaART (maart.sourceforge.net): Provides publicly available tools for Michiel Bacchiani (michiel@google. Tools for music feature extraction for shot boundary detection, event detec- com) is a senior staff research scientist music-based search and retrieval. tion, and copy detection. The data sets leading the speech indexing efforts at ■ HTK (htk.eng.cam.ac.uk): Full- used in the evaluations are only avail- Google, USA. featured hidden Markov model toolkit able to registered participants. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [145] JULY 2011