This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

INVITED PAPER Contextual Internet Multimedia

By Tao Mei, Member IEEE, and Xian-Sheng Hua, Member IEEE

ABSTRACT | The advent of media-sharing sites has led to the compelling and effective way. We conclude this survey with a unprecedented Internet delivery of community-contributed brief outlook on open research directions. media like images and videos. Those visual contents have become the primary sources for . Conven- KEYWORDS | Computer vision; contextual advertising; multi- tional advertising treats multimedia advertising as general text media advertising; survey advertising by displaying advertisements either relevant to the queries or the Web pages, without considering the potential advantages which could be brought by media contents. In this I. INTRODUCTION paper, we summarize the trend of Internet multimedia The proliferation of digital capture devices and the advertising and conduct a broad survey on the methodologies explosive growth of online social media (especially along for advertising which are driven by the rich contents of images with the so called Web 2.0 wave) have led to the countless and videos. We discuss three key problems in a generic private image and video collections on local computing multimedia advertising framework. These problems are: con- devices, such as personal computers, cell phones, and textual relevance that determines the selection of relevant personal digital assistants (PDAs), as well as the huge yet advertisements, contextual intrusiveness which is the key to increasing public media collections on the Internet [8]. detect appropriate ad insertion positions within an image or For example, the amount of digital images captured video, and insertion optimization that achieves the best worldwide in 2011 will increase from 50 billion in 2007 association between the advertisements and insertion posi- to 60 billion according to IT Facts’ report [44]. The most tions so that the effectiveness of advertising can be maximized popular photo sharing site, i.e., Flickr, reached three in terms of both contextual relevance and contextual intru- billion photo uploads at the end of 2008 and 3–5 million siveness. We show recently developed MediaSense which new photos uploaded daily [53], [85], while Youtube drew consists of image, video, and game advertising as an exemplary 5 billion U.S. online video views in July 2008 [118]. On the application of contextual multimedia advertising. In the other hand, we have witnessed a fast and consistently MediaSense, the most contextually relevant ads are embedded growing online advertising market in recently years. at the most appropriate positions within images or videos. To Jupiter Research forecasted that online advertising spend- this end, techniques in computer vision, multimedia retrieval, ing will surge to $18.9 billion by 2010-up, which is about and computer human interaction are leveraged. We also 59 percent from an estimated $11.9 billion in 2005 [50]. envision that the next trend of multimedia advertising would Motivated by the huge business opportunities in the online be game-like advertising which is more impressionative and advertising market, people are now actively investigating thus can promote advertising in an interactive, as well as more new Internet advertising models. To take the advantages of the visual form of information representation, multimedia advertising, which associates advertisements with an online image or video, has become an emerging online monetization strategy.1

Manuscript received April 5, 2009. 1Please note that Bmultimedia advertising[ and Badvertising multi- The authors are with the Microsoft Research Asia, Beijing 100190, China media[ are two different concepts. By multimedia advertising, we refer to (e-mail: {tmei, xshua}@microsoft.com). the process of associating advertisements with multimedia, while advertis- Digital Object Identifier: 10.1109/JPROC.2009.2039841 ing multimedia indicates using multimedia as the form of advertisement.

0018-9219/$26.00 2010 IEEE Vol. 0, No. 0, 2010 | Proceedings of the IEEE 1 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

In an advertising system, there is usually an interme- andcontextualadvertising.Fig.1showstheevolutionof diary commercial ad-network entity (i.e., a service advertising using text and media (such as image, video, and provider between the publisher and advertiser) in charge audio) as information carriers for advertising, respectively. of optimizing the ad selection and displaying with the twin The conventional text-based advertising, i.e., the first gen- goal of increasing the revenue (shared between the pub- eration in the leftmost of Fig. 1, is characterized by de- lisher and ad-network) and improving user experience. livering ads at certain positions on Web pages which are With these goals, it is preferable to the publishers and relevant to either the queries or Web page content. In this profitable to the advertisers to have ads relevant to media generation, paid search and display advertising are the content rather than generic ads. By implementing a solid main strategies which support a Blong-tail[ and Bhead[ multimedia advertising strategy into an existing content business model, respectively. For example, Google’s Ad- delivery chain, both the publishers and advertisers have words [2] and AdSense [1] are successful paid search ad- the ability to deliver compelling content, reach a growing vertising platforms, while DoubleClick [24] and Yahoo! online audience, and eventually generate additional reve- [112] have predominantly focused on the latter. From the nue from online media. perspective of research, the rich research in the first Advertising has embarked on a dramatic evolution, generation has proceeded along three dimensions which will be rapid, fundamental, and permanent. Al- from the perspective of what the ads are matched against: though this evolution is still underway in advertising in 1) keyword- (also called Bpaid search terms of objectives, strategy, and solutions, we can sum- advertising[ or Bsponsored search[) in which the ads are marize the trends of Internet advertising into two genera- matched against the originating queries [49], [75], [106], tions in terms of methodologies: conventional advertising 2) content-targeted advertising (also called Bcontextual

Fig. 1. Trend of Internet advertising in the text and media domain. The online advertising can be summarized as two generations in different information carries of advertising: the first generation is conventional advertising which embeds relevant ads at fixed positions, while the second is contextual advertising embeds ads at automatically detected positions within page and media.

2 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

advertising[)2 in which the ads are associated with the online advertising. Compared with text, image and Web page content rather than the keywords [4], [11], [54], video have some unique advantages which conse- [83], [89], and (3) user-targeted advertising (also called quently make them become the most pervasive Baudience intelligence[) in which the ads are driven based media formats on the Internet: they are more at- on user profile and demography [37], comments [6], or tractive than plain text, and they have been found behaviour [14], [16], [22], [90] . The advertisements in the to be more salient than text, thus they can grab first generation are typically embedded at certain pre- users’ attention instantly [29]; they carry more served blocks in the Web pages. While conventional information that can be comprehended more advertising primarily embeds ads around the content, the quickly, just like an old saying, Ba picture is worth second generation of text advertisingVcontextual adver- thousands of words.[ Media like image and video tising, aims to deliver relevant ads inside the content. For are now used almost as much as text in Web pages example, Vibrant Media [100] associates relevant ads with and have become powerful information carriers for certain keywords or paragraphs within a Web page and online advertising. There is a new advertising embeds these ads in-text. model using media as the carriers for advertising, Using text-based advertising as reference, we can figure in which ads can leave a much deeper impression out online advertising using media as carrier as two gen- due to the salience of visual signal in human erations, including conventional and contextual advertis- perception. ing [40], shown in the rightmostofFig.1.Similartotext, • Theadsareexpectedtobelocally relevant to media the first generation of multimedia advertising directly content and the surrounding text, rather than applies text-based advertising approaches to media and globally relevant to the entire Web page. Compel- embeds relevant ads at certain preserved positions on the ling media content naturally would become the Web pages. For example, Yahoo! [112] and BritePic [10] region of interest (ROI) in a Web page. The most provide relevant ads around images. In video domain, effective way to advertise would be putting ads in- Revver [88] and Youtube [118] which subscribe advertising formation precisely relevant to the media content, service from Google’s AdSense [1] employ pre-roll or post- hoping audience who are interested in this image roll advertising (i.e., embed ads or related videos at the or video would have similar interests to the rele- very beginning or end of videos), or overlay the textual ads vant product or service advertised in it. It is likely on certain video frames (e.g., on the bottom fifth of that the text in a Web page is either too much (e.g., videos 15 seconds in). using whole page text), or too few and/or too noisy It is observed that the first generation of multimedia (e.g., image and video sharing sites), to accurately advertising primarily uses text rather than visual content to describe an embedded image or video. On one match relevant ads. In other words, multimedia advertis- hand, ads picked only based on the whole page ing has been treated as general text advertising without content may not be contextually relevant enough considering the potential advantages which could be to the image or video in that page. On the other, brought by media contents. There are very few systems conventional ad-networks like AdSense [1] and in this generation to automatically monetize the opportu- Adwords [2] cannot work well for the very few or nities brought by individual images and videos. As a result, noisy textual contexts. Therefore, it is reasonable theadsareonlygenerallyrelevanttotheentireWebpage to assume that the media content and its sur- containing images or videos rather than specific to the rounding text should have much more contribu- images or videos it contains. Moreover, the ads are em- tions to the relevance matching than the whole bedded at a predefined position in a Web page adjacent to Web page. the image or video, which normally destroys the visually • Theadsareexpectedtobedynamically embedded at appealing appearance and structure of the original Web the appropriate positions within each individual page. It could not grab and monetize users’ attention image or video (i.e., in-media) rather than at a pre- aroused by these compelling contents. definedpositionintheWebpage.Inconventional It has proved not suitable to treat multimedia ad- advertising, publishers have to reserve certain vertising as general text advertising. The following dis- predefined blocks (please refer to Fig. 1) in the tinctions between media (i.e., image and video) and text Web page for advertisingVbeing banners or other advertising motivate a new advertising generation dedi- forms of ads. Such advertising strategy has proved cated to media. intrusive to Internet users [74], as the ad blocks • Beyond the traditional media of Web pages, images have significantly broken the page structure and and videos can be powerful and effective carriers of visual appearance, as well as they are unattractive or boring to users. Now that users’ attention is on 2 Please note here, Bcontextual relevance[ mainly indicates that the the media, by embedding the ads within an image relevance is derived from the entire web content. In a broader view, contextual relevance includes not only the relevance, but also the position or video, the ads will in turn get more attention. where the advertisements are inserted in a Web page. Meanwhile,thepublishernolongerneedsto

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 3 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

Fig. 2. Screenshots of ImageSense, VideoSense, and GameSense. The highlighted areas in (a) and (b) correspond to the ads, while a missing block in (c) corresponds to the ad position. The ads are inserted into the non-salient spatial within images or temporal positions in video streams via visual saliency analysis. The ads are relevant to both the visual content and Web page rather than only relevant to the entire Web page. (a) ImageSense [78]; (b) VideoSense [79], [80]; (c) GameSense [59], [60].

worry about the reserved blocks. By putting ads advertising, as well as the key problems. Sections III–V only in the non-salient portions in an image or address how we can leverage computer vision and multi- video, it reduces the intrusiveness of display ads media retrieval techniques to solve these problems in and the user experience will be improved at the details. Section VI shows the implementations of same time. ImageSense, VideoSense, and GameSense. Section VII Motivated from the above observations, we go one step discusses how to evaluate the performance of multimedia further from the first generation of multimedia advertising advertising systems. Section VIII concludes this paper and propose in this paper the second generation which and outlooks future challenges. supports contextual multimedia advertising by associating the most relevant ads to an online medium (image or video) and seamlessly embedding the ads at the most II. SYSTEM AND KEY PROBLEMS appropriate positions within this medium. As show in the rightmost of Fig. 1, the ads are selected according to A. Terminology multimodal relevance, i.e., the ads are to be globally To clearly present the system framework of the two relevant to the Web page containing images or videos, as generations of multimedia advertising, we will adopt a well as locally relevant to the content and surrounding text standard vocabulary to describe many of the common of each suitable medium. Meanwhile, the ads are embedded aspects and terms across each of the exemplary systems. at the most non-salient positions within the medium. By • Advertisement (ad): Advertisement is a public leveraging computer vision and multimedia retrieval notice or announcement for calling something to techniques, we are on the positive side to better solve two the attention of the public, especially by paid challenges in the Internet multimedia advertising, i.e., ad announcements. In multimedia advertising, adver- relevance and ad position. We demonstrate MediaSense tisements take a variety of forms, including text which includes ImageSense [78] and VideoSense [79], [80] banner, image, video (i.e., traditional TV commer- as two exemplary applications in the new generation, cial), animation, or a combination of forms. dedicated to image and video, respectively. We also envision Advertising is a form of communication that that the next trend of multimedia advertising would be typically attempts to persuade potential customers game-like advertising which would be more impressiona- to purchase or to consume more of a particular tive and thus can promote the advertisements in an brand of product or service. In this paper, we interactive way. We show GameSense [59], [60] as an mainly discuss two types of ads, i.e., image and example of the next advertising platform. Fig. 2 shows the video ads. screenshots of ImageSense, VideoSense, and GameSense. It • Image ad: An image advertisement is a static is also worth noticing that many metrics have been adopted image or banner provided by advertisers that will to evaluate the performance of a multimedia advertising be inserted into or associated with a source image system. We review the performance evaluation in different or video. An image ad could be a product logo [59], domains. [60], [66], [69], [78] or a banner composed of a The rest of the paper is organized as follows. Section II product logo, product name, description, and link provides a system overview of contextual multimedia [31], [76], [101], [118].

4 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

• Video ad: A video ad is a video clip about a pro- associated with media, then this type of advertising duct. Although video ads will be associated with a is contextual multimedia advertising. video in MediaSense [25], [80], they may be in • Multimodal relevance: A modality is defined as different forms (or a combination of forms), in- any source of information about media contents cluding typical commercials in TV programs [19], that can be leveraged algorithmically for analysis in [25], [39], [67], as well as a clip composed by text, [52]. In multimedia advertising applications, the animations, or images. modality can be decomposed into various modal- • Source media: Source media are most often ities which measure various low-level aspects of produced or owned by content providers (or visual data (such as the color and textures in an partially by publishers who help to distribute the image, the motions in a video sequence, as well as contents), which may be images, videos, or audio the tempos in an audio stream), some mid-level clips, captured and provided by professional visual concept or object categories (such as people, photographers, videographers, or grassroots. The location, and objects), as well as high-level textual advertisements will be embedded at certain descriptions associated with the data (such as user- positions around, within, or overlay the source provided tags on an image or video, transcripts media. associated with a video stream, automatically rec- • Ad insertion point: A point/position where one or ognized captions on a video frame). The relevance more advertisements will be associated. Ad inser- can be measured by the similarity between two tion point could be a spatial region around the media files in terms of certain types of modalities. medium in a Web page, a region on an image, a For example, there are textual, visual, and aural spot on the timeline of a video, a spatiotemporal relevance. Accordingly, the multimodal relevance patch in a video, or even a position out of the Web is a combination of the results from various page. For example, the highlight rectangle in modalities between two media. Fig. 1(a), the yellow spots on the timeline in Fig. 1(b), and the center missing block in Fig. 1(c) B. General Framework are ad insertion points. Fig. 3 shows the general framework of multimedia • Contextual advertising: Contextual advertising advertising. It also summarizes the distinctive between the refers to the placement of commercial advertise- two generations of multimedia advertising, as we men- ments within the content of a generic Web page tioned in Section I. Given an online medium which could based on similarity between the content of the be an image within a Web page, a collection of image target page and the ad description provided by the search results, a video sequence, or even an audio stream, a advertiser [11], [83]. If the advertisements will be list of candidate ads are selected from an ad inventory and

Fig. 3. A general framework of multimedia advertising. Conventional advertising only focuses on text-based ad relevance matching, while contextual advertising like MediaSense [40] considers not only textual relevance but also multimodal (textual, visual, and aural) relevance, as well as automatic detection of appropriate ad insertion points within media.

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 5 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

ranked according to the relevance between the source the media content, as well as low-level visual and medium and the ads. The multimodal relevance can be high-level semantic similarity between the ads and derived from textual information, such as user-provided ad insertion points. tags, video transcripts, automatically recognized captions, • Contextual intrusivenessVWhere should the and also can be derived from the low-level similarity in selected ads be inserted so that the contextual terms of color and textures in images, camera or object intrusiveness will be minimized? Ad position will motions in video sequences, or tempo and beat in audio certainly affect user experience when an image or a streams, as well as semantic level in terms of automatically video is viewed [74]. In contextual multimedia detected visual concepts and object categories [27], [77], advertising, the selected ads are to be inserted into [84], [95]. Meanwhile, a set of candidate ad insertion the most non-intrusive positions within the media. points are automatically detected on the basis of spatio- • Insertion optimizationVGiven a ranked list of temporal visual saliency analysis [15], [64], [69], [73], candidate ads and ad insertion points, how to [78]–[80], [103]. Intuitively, the ads would be inserted associate each ad with the ad best insertion point? into the most non-salient positions within the media The objective is to maximize the effectiveness of contents so that the ads would be nonintrusive to the advertising by simultaneously minimizing the viewers. Moreover, the ads associated with the media contextual intrusiveness to viewers and maximiz- would be the most relevant to the contents. By minimizing ing the contextual relevance between the ads and contextual intrusiveness and maximizing contextual rele- media. vance simultaneously, the effectiveness of advertising can • Rich displayingVHow the selected ads are be maximized [74]. Given a list of candidate ads and ad displayed or rendered? The rich displaying in- insertion points, an optimization-based association module cludes the duration of each ad, the way the ad is will associate each candidate ad with the best insertion rendered, the support of interaction between the point. ad and users, the rich information associated with We can see from Fig. 3 that conventional multimedia ads. An effective displaying will make the adver- advertising only focuses on ad relevance matching, re- tising not have an intrusive experience to users. ferred to as Btext-based ad matching[ and Bcandidate ad In this paper, we mainly focus on the first three prob- list[ modules, while the new generation of advertising not lems while leave the fourth an open issue. The compar- only considers Bmultimodal ad matching[ for ad selection isons between conventional and the proposed contextual but also investigates the problem of Bad insertion point multimedia advertising in terms of contextual relevance detection.[ Finally, given a set of candidate ads and ad and contextual intrusiveness are listed in Table 1. insertion points, the Boptimization-based ad delivery[ module will associate the most relevant ads with the most appropriate insertion points by maximizing the overall III. CONTEXTUAL RELEVANCE contextual relevance while minimizing the overall One of the fundamental problems in contextual advertis- intrusiveness. ing is Brelevance[ which in studies detracts from user experience and increases the probability of reaction [57], C. Key Problems [74]. By contextual relevance, we refer to the fact that ads In general, there are four key problems in an effective are expected to be relevant both to the entire source media contextual multimedia advertising system: contextual rele- and the local ad insertion points within the media. The vance, contextual intrusiveness, insertion optimization,and contextual relevance for each pair of ad and ad insertion rich displaying. point is a multimodal relevance consisting of textual, • Contextual relevanceVWhich ads should be se- visual, conceptual, and user relevance. In this section, we lected for a given image or video? Since relevance will review the relevance from different modalities. increases advertising revenue [57], [74], contextual multimedia advertising performs multimodal rele- A. Textual Relevance vance matching by considering both global textual The major effort in advertising has focused on text do- relevance from the entire Web page and local main. There exists rich research in the literature on textual relevance from textual information associated with relevance that can be leveraged or applied to multimedia

Table 1 Comparisons Between Conventional and Contextual Multimedia Advertising, in Terms of Contextual Relevance and Contextual Intrusiveness

6 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

advertising. The literature review on text relevance in this embedded in the page, while the surrounding texts which paper will focus on two key issues: 1) ad keyword selection, are spatially close to the media can better describe the i.e., how to pick suitable keywords, Web pages, or images contents and lead to better ad relevance. 2) The sur- for advertising so that the relevance can be improved, and rounding texts associated with an image or video are 2) ad relevance matching, i.e., how to select relevant ads sometimes too few for selecting relevant ads. The hidden according to a set of selected keywords or a Web page. texts (e.g., expanded words, visual concepts, object Typical advertising systems analyze a Web page or categories, or events) which are automatically recognized query to find prominent keywords or categories, and then from visual signals can more precisely describe the media match these keywords or categories against the words for contents. Using the surrounding and hidden texts can yield which advertisers bid. If there is a match, the correspond- better textual relevance. ing ads will be displayed to the user through the web page. Given a Web page containing images or videos, it is Yih et al. has studied a learning-based approach to auto- desirable to first segment it into several blocks with matically extracting appropriate keywords from Web pages coherent topic, detect the blocks with suitable images or for advertisement targeting [117]. Instead of dealing with videos for advertising, and extract the semantic structure general Web pages, Li et al. propose a sequential pattern such as the surrounding texts from these blocks. The mining-based method to discover keywords from a specific Vision-based Page Segmentation (VIPS) algorithm [12], broadcasting content domain [58]. In addition to Web [13] is adopted to extract the surrounding texts associated pages, queries also play an important role in paid search with a medium in [59], [60], [78], [79]. The VIPS algo- advertising. In [94], the queries are classified into an in- rithm makes full use of page layout structure. It first termediate taxonomy so that the selected ads are more extracts all the suitable blocks from the Document Object targeted to the query. The works in [56], [59], [60], [78]– Model (DOM) tree in html, and then finds the separators [80] present the methods for detecting potential suitable between these blocks. Based on these separators, a Web images or videos in a Web page for advertising, by analyz- page can be represented by a semantic tree in which each ing the structure of Web page and the visual appearance of leaf node corresponds to a block. In this way, contents with images or videos. different topics are distinguished as separate blocks in a As we have mentioned in Section I, research on ad Web page. Fig. 4(a) and (b) illustrate the vision-based relevance has proceeded along three dimensions from the structured of a sample page. It is observed that this page perspective of what the adsarematchedagainst: has two main blocks and the block named BVB-1-2-1-1[ is 1) keyword-targeted advertising (Bpaid [ detected as the video block. Specifically, after obtaining all or Bsponsored search[), 2) content-targeted advertising the blocks via VIPS, the images or videos which are suit- (Bcontextual advertising[), and 3) user-targeted advertis- able for advertising in the Web page are elaborately ing (Baudience intelligence[). Although the paid search selected. Intuitively, the images or videos with poor visual market develops quicker than contextual advertising mar- qualities, or belonging to the advertisements (usually ket, and most textual ads are still characterized by Bbid placed in certain positions of a page) or decorations phrases,[ there has been a drift to contextual advertising as (usually are too small), are first filtered out. Then, the it supports a long-tail business model [57]. For example, a corresponding blocks with the remaining images or videos recent work examines a number of strategies to match ads are selected as the advertising page blocks. The surround- to Web pages based on extracted keywords [89]. A follow- ing texts (e.g., title and description) are used to describe up work applies Genetic Programming (GP) to learn func- each image or video. tions that select the most appropriate ads, given the Based on the surrounding texts, the expansion text can contents of a Web page [54]. To alleviate the problem be obtained by leveraging query expansion based on user of exact keyword match in conventional advertising, log[21],whilethehiddentextsareobtainedbyautomatic Broder et al. propose to integrate semantic phrase into text categorization [116] and video concept detection [77], traditional keyword matching [11]. Specifically, both the [87]. Specifically, we use text categorization based on pages and ads are classified into a common large taxo- Support Vector Machine (SVM) [116] to automatically nomy, which is then used to narrow down the search of classify a textual document into a set of predefined cate- keywords to concepts. Most recently, Hu et al. propose to gory hierarchy which consists of more than 1000 cate- predict user demographics from browsing behavior [37]. gories. The concept texts are selected by using the top The intuition is that while user demographics are not easy concepts with the highest confidence scores in a specific to obtain, browsing behaviors indicate a user’s interest and ontology like LSCOM [77]. The hidden texts can inform profile. the direct text as the surrounding text related to video is When applying textual relevance to visual domain, in usually not quality-controlled. Fig. 4(c) shows these two addition to the techniques discussed above in text domain, types of texts with the corresponding probabilities derived the characteristics of media should be taken into account from (a) and (b). from the following perspectives: 1) The entire texts in the Given the surrounding texts, the vector space model Web page are too noisy and broad to describe the media (VSM) [7] can be adopted to measure the textual relevance

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 7 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

Fig. 4. Text extraction from a Web page. ‘‘VB’’ indicates the visual block. A source video is extracted from block ‘‘VB-1-2-1-1,’’ while the textual information is extracted from block ‘‘VB-1-2-2.’’ (a) The segmented Web page, (b) DOM tree, (c) Textual information including surrounding and hidden texts.

between a source medium and a candidate ad [78], [89]. All tions, so that the users may perceive the ads as a natural the texts associated with the ad is supposed to be provided part of original image. To measure the visual relevance or bid by advertisers. In the VSM, each textual document is between the ad and ad insertion position, the L1 distance in represented as weighted vectors in an n-dimensional space. a HSV color space between the ad block and the neigh- The weight can be computed using the product of the boring image blocks is adopted. This distance has been term frequency ðtfÞ andinverteddocumentfrequency widely adopted in visual retrieval and search systems [23], ðidfÞ, reflecting the assumption that the more frequently a [77]. Instead of global feature, local features such as the word appears in a document and the rarer the word scale-invariant feature transform (SIFT) descriptors [71] appears in all documents, the more informative it is. The can be used for finding the visually exact match between a ranking of the ads with regard to the document is video frame and a product logo [66]. computed by the cosine similarity between their surround- In contextual in-video advertising (or in-stream) [31], ing texts [7]. [76], [79], [80], the visual relevance is measured by a set of visually perceptual features such as motion and color, as B. Visual Relevance well as high-level concept text [as shown in Fig. 4(c)]. The Different from textual relevance which is globally de- visual relevance is usually combined with the aural rele- fined on the entire image or video, the visual relevance is vance which is derived from audio track. Specifically, the regarded local as it is usually defined on the basis of a local motion intensity is used to characterize motion which is ad insertion point. The visual relevance measures the local computed by averaging the frame differences within a shot similarity between an ad insertion point and ad, in terms of [38]. The color is represented by a 16-dimensional domi- visual appearance. Intuitively, the inserted ads are ex- nant color histogram in HSV space [79]. The audio tempo pected to have similar appearance with the local insertion is used to describe the rhythm and energy in audio track points, so that viewing experience may not be degraded as [38]. These features have proved to be effective to describe much as possible [15], [31], [69], [76], [78], [91]. Likewise, video content in many existing multimedia applications in a video advertising scenario, the ads are expected to [38], [114]. More sophisticate low-level features related to have similar visual-aural style with the video content visual-aural relevance can be also applied here. Actually, around the insertion point [79], [80], [96]. Such visual the authors suggest various ways for using local visual relevance can be adjusted according to different advertis- relevance in [31], [76], [79], [80]. For example, we can use ing strategies. For example, if the ad-network cares the the Bpositive[ local visual relevance to keep high similarity advertisers more than the viewers, it can make the visual between the video and ad content, or use it in a Bnegative[ relevance as low as possible, so that the ads can attract way to gain more attention from viewers because of the viewers’ attention as much as possible. However, which high contrast. Typically, the Bpositive[ way is chosen since strategy for visual relevance would be the best for it is more natural for the viewers. For example, when contextual multimedia advertising still remains an open viewing an online music video, users may feel that an ad issue. withsimilarmusictempodoesnotdegradetheirexpe- In contextual in-image advertising [59], [60], [65], riences. Since local relevance indicates the visual-aural [78], the product logos are embedded in non-salient image similarity between a source video shot and a video ad, the blocks. The ads are assumed to have similar appearance set of visual and aural features is computed at video level with the neighboring blocks around the insertion posi- by averaging the features over all shots for ad, rather than

8 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

computed at shot level for source video. The visual re- IV. CONTEXTUAL INTRUSIVENESS levance between the insertion point and video ad is the Contextual relevance deals with the selection of relevant linear fusion of the visual similarities between the ad and ads according to a given image or video, while contextual the source video shots around the insertion point (i.e., intrusiveness answers the question where the selected ads before and after the shot boundary). Please note that the are embedded. Detecting ad insertion point within a me- insertion point in the in-video advertising system is a shot dium is on the contrary to traditional commercial detec- boundary or a story break. Different from the in-video tion in TV programs [25], [39], [67]. Similar to the visual advertising in [76], [79], [80], the vADeo system attempts relevance described in Section III-B, there are various to solve visual relevance by finding the same person in a strategies for finding ad insertion points via contextual video segment via face detection and recognition [96]. intrusiveness. One way is to find the most interesting or highlighting segments in a video or the most salient region C. Conceptual Relevance in an image as ad insertion point, so that the ad impression Given the hidden texts which can be obtained by the willbemaximized[57],[74].Theotheriscontrary,i.e.,to methods described in Section III-A, a probabilistic model seek the most non-salient parts as ad insertion points, so [79], [80], [114] is adopted to measure the conceptual that users may not feel intrusive when browsing the aug- relevance between a source medium and a candidate ad. mented media with ads. Finding suitable insertion points is The probabilistic model represents the categories or con- actually a kind of trade-off between ad impression and cepts in a hierarchical tree, in which each node corresponds viewing experience, which deserves a deep study on the to a category or concept. The relevance between two docu- effectiveness of advertising. Technically, contextual intru- ments is the sum of the relevance between all possible pairs siveness is the key to automatically detect ad insertion of categories or concepts. The relevance between two nodes point in a given medium. Traditional advertising adopts in this tree is measured by their probabilistic distance. the preserved blocks in a Web page as ad positions, while Another way to compute conceptual relevance is re- in the new generation of multimedia advertising, computer presenting the hidden text associated with a medium by a vision techniques such as visual saliency detection and normalized vector, with each element corresponding to the video content analysis can benefit the automatic detection probability of certain category or concept [34], [36]. Then, of such positions. We next describe how to detect ad in- the relevance between the ad and medium can be measured sertionpointsintheimageandvideodomainviacon- by the L1 distance or other distance metrics between the textual intrusiveness analysis, respectively. text of ads (provided by advertisers) and hidden texts associated with the media. Most recently, Wu et al. propose A. Contextual Intrusiveness in Image Domain to measure the textual relevance between two concept In contextual image advertising, the relevant ads are words via Flickr distance, which is built upon the visual embedded at certain spatial positions within an image. The language models from a collection of images searched by image ad-network finds the non-intrusive ad positions the concepts [109]. Flickr distance is a new textual within an image and selects the ads whose product logos relevance designed from the perspective of visual domain. are visually similar or have similar visual style to the image, so as to minimize intrusiveness and improve user expe- D. User Relevance rience. Specifically, in [78], the candidate ad insertion With social networks are becoming more and more positions (usually image blocks) are detected based on the pervasive, delivering user-targeted ads for multimedia ad- combination of image saliency map, face detection, and vertising has become an emerging issue. Based on user text detection, while visual similarity is measured on the relevance,theadsareexpectedtoberelevanttouserpro- basis of HSV color feature. file, interest, location, click-through, historical behavior In ImageSense [78], given an Image I which is repre- (e.g., travel traces), and so on. For example, the perso- sented by a set of blocks, a saliency map S representing the nalized ad delivery in Interactive Digital Television (IDTV) importance of the visual content is extracted by investigat- has been a potentially hot application [51], [55], [97], [99]. ing the effects of contrast in human perception [73]. Such advertising refers to the delivery of advertisements Fig. 5(d) shows an example of the saliency map of an image. tailored to the individual viewers’ profiles on the basis of The brighter the pixel in the salience map, the more salient knowledge about their preferences [55], current and past or important it is. However, saliency map predominantly contextual information [51], [97], or sponsors’ preference focuses on modeling visual attention while neglects face [99]. In these systems, the users are assumed to be grouped and embedded text (i.e., caption) in images. In fact, face into a set of predefined interest groups or provide their and text usually present informative content in an image. A profile in advance, and then the ads falling into the users’ product logo should not overlay the areas with face or text. interest will be delivered based on text matching and clas- Based on this assumption, we can perform face [61] and sification techniques described in Section III-A. However, text detection [17], [41] for each image and obtain the face most of these systems do not study relevance in terms of and text areas. Then, the saliency map S is overlaid on the visual content. detected face and text areas by a max operation. In this way,

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 9 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

[66], as an overlay video on certain frames [31], [76], in the story or scene breaks in a video [79], [80], or even into a spatiotemporal portion in a video [15], [62], [64], [69], [103]. The detection of in-stream ad insertion point is based on contextual intrusiveness [79], [80]. A pre-processing step is assumed to parse the source video into shots and represent each shot by a key-frame using the color-based method [120]. Each shot boundary is naturally a candidate ad insertion point. Li et al. investigate eight factors that affect consumers’ perceptions of the intrusiveness of ads in traditional TV programs [57]. Two computable measure- ments based on these eight factors, i.e., content disconti-

Fig. 5. An example of detecting candidate ad positions in an image in nuity and attractiveness are excerpted from [57]. The [78]. (a) I: original image, (b) face detection result, (c) text detection content discontinuity measures content Bdissimilarity[ result, (d) S: image saliency map, (e) C: combined saliency map, between two video shots, while content attractiveness (f) W: weight map, (g) final saliency map combining (e) and (f), measures the Bimportance[ or Binterestingness[ of the (h) the least nonintrusive candidate ad position. The brighter the pixel content within a shot. The higher the discontinuity, the in the saliency map (d) and (e), the more important or salient it is; while the brighter each block in (g), the more likely it is a candidate more likely the corresponding insertion point is a bound- ad position. The highlighted rectangle in (h) is obtained from (g). ary of two stories. In fact, different combinations of dis- We use 5 5 for illustration in this figure. continuity and attractiveness fit the requirements of different roles. For example, it is intuitive that ads are expected to be inserted at the shot boundaries with high a combined saliency map C is obtained in which the value of discontinuity and low attractiveness from the viewers’ each pixel indicates the overall salience for ad insertion. perspective. On the other, Bhigh discontinuity plus high Fig. 5(b)–(d) show face, text, and saliency maps of the ori- attractiveness[ may be a tradeoff between viewers and ginal image in (a), while (e) shows the combined saliency map. advertisers. Then, the detection of ad insertion points can Intuitively, the ads should be embedded at the most be formulated as ranking the shot boundaries based on non-salient regions in the combined saliency map C,so different combinations of content discontinuity and that the informative content of the image will not be attractiveness. occluded and the users may not feel intrusive. Meanwhile, In [79], [80], the authors seek a soft measure of a shot the image block set B is obtained by partitioning image I boundary to be an ad insertion point. As shown in Fig. 7, a into grids. Each grid corresponds to a block bi and also a degree of discontinuity is assigned to each insertion point. candidate ad insertion point. For each block, a combined The higher the discontinuity, the more likely the corre- saliency energy ci (0 r ci r 1) is computed by averaging all sponding insertion point is a boundary of two stories. An the normalized saliency energiesofthepixelswithinbi.As improved BFMM (so-called BiBFMM[)isproposedto the combined saliency map C does not consider the spatial deal with the interlaced repetitive pattern problem in importance for ad insertion, a weight map W ¼fwig is BFMM and assign each boundary a soft discontinuity designed to weight the energy ci, so that the ads will be value [122]. In the preprocess step, the most similar shots inserted into the corners or sides rather than center blocks. are merged at different scales to eliminate interlaced Fig. 5(f) and (g) show an example of the weight map and repetitive pattern. In the BFMM and normalization step, wi ð1 ciÞ, respectively. Therefore, wi ð1 ciÞ indi- the merge order is recorded and normalized as the final cates the content non-intrusiveness of block bi for discontinuity. embedding an ad. As a result, the top-left block is selected In general, it is difficult to evaluate how a video as the ad insertion point in Fig. 5(h). clip attracts viewers’ attention since Battractiveness[ or Rather than overlay the ads at non-salient regions within images, Li et al. propose to overlay the ads on the center of the image before the full-resolution images are downloaded [65]. Fig. 6 shows the basic idea of delivering ads on the image. From this perspective, the contextual intrusiveness depends on the time when images appear on the Web pages.

B. Contextual Intrusiveness in Video Domain Fig. 6. The ad is overlaid on the image before the full-resolution image The ads in contextual video advertising can be is downloaded in [65]. The image in (a) is the original full-resolution inserted in a preserved page block around the video image, while the images and ads in (b)–(d) appear according to time.

10 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

Fig. 7. Candidate ad insertion point detection in [79], [80]. Suppose there are Ns shot or scene boundaries in a video sequence, then the number of candidate points is Ns þ 1 (including the beginning and the end of the video), each boundary indicating one candidate insertion point.

Battention[ is a neurobiological concept. Alternatively, a V. INSERTION OPTIMIZATION user attention model is proposed by Ma et al. to estimate Given a list of ads ranked according to their contextual human attention by integrating a set of visual, auditory, relevance and a list of ad insertion points ranked according and linguistic elements related to attractiveness [72]. to their contextual intrusiveness, how to associate each ad Another approach to exploring the attention in video se- with one of insertion points so that the effectiveness of quence is to average static image attention over a segment of frames [73]. In [79], [80], an attention value is com- contextual advertising could be maximized? As we have puted for each shot by the user attention model in [72]. mentioned in Section II-C that an effective advertising The content attractiveness of an insertion point is highly system is able to maximize contextual relevance while related to the neighboring shots on both sides of this point. minimizing contextual intrusiveness at the same time. This Therefore, the attractiveness of ad insertion point is com- problem is formulated as a non-linear 0–1 integer prog- puted by weighted averaging the attention values of its ramming problem (NIP) in [78], [80], with each of the neighboring shots. above rules as a constraint. Different from video advertising for general video For example, without of loss of generality, the task of contents in [79], [80], to make video contents more insertion optimization can be defined as the association of enriching, researchers have attempted to spatially replace ads with insertion points in an online medium which a specific region with product advertisement in sports might be an image or a video. Suppose we have an ad videos [15], [64], [69], [103]. These regions could be database A which contains Na ads, represented by Na locations with less information in baseball [64] and tennis A¼faigi¼1, and we also have a set of candidate ad video [15], the region above the goal-mouth in soccer insertion points P which contains Np points, represented Np video [103], or the smooth regions in the video highlights by P¼fpjgj¼1. The contextual relevance Rðai; pjÞ [69].Fig.8showssomeexamplesofadvertisementsin between each ad ai and point pj can be obtained by the sports video. An online platform is also presented to approaches in Section III (i.e., the linear combination of measure the quality of product placement [45]. The textual, visual, conceptual, and user relevance), while the domain-specific approaches in these applications, such as contextual intrusiveness Iðai; pjÞ can be obtained in the detection of line and less-information-region [64], Section IV. The objective of contextual advertising is to [103], are not practical in a general case, especially in maximize the overall contextual relevance RðA; PÞ while online videos. Li et al. propose to find the most non-salient minimizing the overall contextual intrusiveness IðA; PÞ. space-time portions in the video [62]. They formulate the The following design variables can be introduced for problemasaMaximumaPosterior(MAP)problemwhich problem formulation, i.e., x 2 RNa , y 2 RNp , x ¼ T T maximizes the desired properties related to less intrusive ½x1; ...; xNa , xi 2f0; 1g,andy ¼½y1; ...; yNp , yj 2 viewing experience, i.e., informativeness, consistency, visual f0; 1g,wherexi and yj indicate whether ai and pj are naturalness,andstability. selected ðxi ¼ 1; yj ¼ 1Þ in A and P.Giventhenumberof

Fig. 8. Ad insertion point detection in sports videos. Virtual contents are inserted in the smooth areas by leveraging domain-specific knowledge in sports video. (a): [69], (b): [15], (c): [64], [103].

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 11 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

ads N to be inserted in a medium, the NIP formulation ing, called MediaSense. MediaSense includes ImageSense [9] is [78], VideoSense [79], [80], and GameSense [59], [60] which are dedicated to online image, video, and game, respectively. Specifically, we introduce how the relevant XN XNp a ads are selected and how the ad insertion positions are max fðx; yÞ¼ xiyjRðai; pjÞ ð ; Þ automatically detected, as well as how the ads and these xi yj i¼1 j¼1 insertion points are associated in these applications. XNa XNp xiyjIðai; pjÞ i¼1 j¼1 A. ImageSense ¼ xTRy xTIy A snapshot of ImageSense is shown in Fig. 2(a).

XNa ImageSense is able to automatically decompose a Web page s:t: xi ¼ N; into several coherent blocks, select the suitable images i¼1 from these blocks for advertising, detect the nonintrusive XNp ad insertion positions within the images, and associate the ; ; ; yj ¼ N xi yi 2f0 1g (1) relevant advertisements (i.e., product logos) with these j¼1 positions [78]. The ads are selected based on not only textual relevance but also visual relevance, so that the ads NaNp where R 2 R , R ¼½Rij, Rij ¼ Rðai; pjÞ,andI 2 yield contextual relevance to both the text in the Web page NaNp R , I ¼½Iij, Iij ¼ Iðai; pjÞ, and are two weights and the image content. The ad insertion positions are for linear combination. Other constraints such as the detected based on image salience, as well as face and text uniform distribution of ad insertion points in the source detection, to minimize intrusiveness to the user. An ex- video [79], [80] can be added in (1). ample of image tagged with Bdomestic[ and Bkitten[ which N N ! is associated with the product logo of BAnimal Planet[ is It is observed that there are CNa CNp N solutions in total to (1). As a result, when the number of elements in A and shown in Fig. 9. We can see that the ads ranked only by P is large, the searching space for optimization increases textual relevance are not closely related to this image. dramatically. However, the Genetic Algorithm (GA) [107] However, by contextual relevance, we can know this image can be employed to find solutions approaching the global is about Bcat[ and Banimal[ as we can build specific visual optimum. Alternatively, the above problem can be solved models for recognizing Bcat[ and Banimal.[ Therefore, the by a similar heuristic searching algorithm in practice [78]– ads related to animal can be ranked higher. Furthermore, a [80]. In this way, the number of possible solutions can be suitable position is selected by image saliency detection so 1 1 ! that the visual similar ad is recommended to be embedded significantly reduced to CNa CNp N . Note that (1) is a general formulation for advertising. In fact, it can be easily ex- on the top-right corner in this image. tended to various advertising strategies from different perspectives. The authors in [79] have given detailed dis- B. VideoSense cussions on supporting diverse advertising scenarios based A snapshot of VideoSense is shown in Fig. 2(b). on this framework. VideoSense is an in-video advertising system that is able to elaborately detect a set of appropriate ad insertion points (i.e., shot breaks) based on content discontinuity and VI. EXEMPLARY SYSTEM: MEDIASENSE attractiveness, and associate the most relevant video ads to We will show the implementations of recently developed these points, according to not only global textual relevance exemplary application of contextual multimedia advertis- but also local visual-aural relevance [79], [80].

Fig. 9. A sample image with ad embedded in ImageSense [78]. (a) The source image and image with ads embedded, (b) the ranked ad list by only textual relevance and contextual relevance.

12 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

VII. PERFORMANCE EVALUATION In order to obtain a fair empirical evaluation of multimedia advertising systems, it is important to use standard and representative data sets and performance metrics. How- ever, there are no benchmark data sets in the research communities. Most experimental results were reported using different data sets. Therefore, we only review the performance metrics which are commonly used in the typical advertising systems. These metrics are categorized as follows in terms of the preliminary domains in which they are adopted. • Online advertising. The performance of online advertising is commonly measured by ads Click- Fig. 10. An example of a video subscribing overlay ads service [31], Through Rate (CTR) or the revenue from adver- [76]. The highlighted yellow rectangle indicates that an ad is overlaid on this video shot, while the yellow spots on the timeline tisers [74]. A CTR is obtained by dividing the indicate the overlay ads in the video stream. The relevant ads are Bnumber of users who clicked on an ad[ on a web embedded at the non-salient positions (bottom or top) on the suitable page by the Bnumberoftimestheadwasdeli- frames, while the corresponding accompany ads appears beside the vered[ (impressions) [108]. The commonly used video player (on the right). pricing models, such as pay-per-click (PPC), cost- per-thousand (CPT), and cost-per-mille (CPM), etc., are based on the estimation of CTR. However, In addition to in-video advertising, the overlay video CTR is usually difficult to obtain without a long- advertising can be supported based on the same framework term investigation in a real advertising system. of VideoSense [31], [76]. Fig. 10 shows the example of a • Information retrieval. The key problem for adver- video subscribing intelligent overlay video advertising ser- tising is the relevance between advertisements and vice. Unlike most of current ad-networks such as Youtube landing pages or programs (i.e., images or videos). [118] and Revver [88] that overlay the ads at fixed The relevance is usually judged by human on seve- positions in the videos (e.g., on the bottom fifth of videos ral scales (e.g., Brelevant,[Bsomewhat relevant,[ 15 seconds in), the ads in [31], [76] are automatically over- and Birrelevant[). Based on these judges, the laid on the frames based on highlight detection and visual relevance can be measured by the Precision-Recall saliency analysis. An accompany ad with more information (P-R) curve, Average Precision (AP), Normalized about the product or service is displayed on the right. Discounted Cumulative Gain (NDCG), and so on [47],[102].Forexample,recall is defined as frac- C. GameSense tion of relevant advertisements (in the whole data A snapshot of GameSense is shown in Fig. 2(c). set) which has been retrieved for a given document GameSense is a game-like advertising system built upon or program, while precision is the fraction of the ImageSense framework [59], [60]. Given a Web page retrieved advertisements (in the returned subset) which typically contains images, GameSense is able to which is relevant [7]. Then, a P-R curve can be select a suitable image to create online in-image game and generated by plotting the curve of precision versus associates relevant ads within the game. The contextually recall. The AP corresponds to the area under a relevant ads (i.e., product logos) are embedded at appro- non-interpolated P-R curve, which is widely priate positions within the online games. The image blocks adopted in the TRECVID [98]. NDCG is a in the ad positions are used as missing blocks in the commonly adopted metric for evaluating a search B [ sliding puzzle game. The ads are selected based on not engine’s performance. Given a query q,theNDCG only textual relevance but also visual similarity. During the score at the depth d in the ranked documents is game, the ads can alternate once the user has proceeded defined by onestep.Thegameisabletoprovideviewersrichexpe- rience and thus promote the embedded ads to provide more effective advertising. Furthermore, users participat- Xd j ing the games can get incentives from the multiple-player 2r 1 NDCG@d ¼ Zd (2) mode. On the other hand, advertisers can achieve more ad j¼1 logð1 þ jÞ impression during the game. The idea is similar to ESP game which tackles the problem of creating difficult meta- j data via human power and computer game [3], [30]. The where r istheratingofthej-th document, Zd is a GameSense represents one of the first attempts towards normalization constant and is chosen so that a impressionate advertising. perfect ranking’s NDCG@d value is 1.

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 13 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

• Multimedia advertising. In addition to the rele- Multimedia advertising is a vibrant area of research. vance, user experience is an important problem in There are a lot of emerging topics deserving deep inves- this domain. The evaluation of user experience tigation and research. We can summarize the future often depends on a series of subjective tests or user challenges as follows. studies [91]. However, there are no commonly How can we improve contextual relevance?Aswe used metrics for the subjective evaluations. Differ- have mentioned in Section III, contextual relevance comes ent researchers used different kinds of evaluations from textual, visual, conceptual, and user relevance. Each andmetrics.Forexample,theauthorsin[76],[78], type of relevance is a challenging problem in the corre- [80] let subjects judge Bsatisfaction on ad location[ sponding research community, ranging from information and Boverall satisfaction on advertising effect[ on a retrieval, computer vision, human computer interface, and 1 to 5 scale. The authors in [69], [91], [103], [111] so on. From the perspective of vision, there exist some designed a survey to collect user feedback (e.g., possible investigations towards better ad relevance. Bpositive[ or Bnegative[) after they viewed the 1) Visual categorization in image and video domains advertisements. can be used for better and more precisely de- • Human-computer interaction. Researchers in scribing visual contents and in turn improve con- this domain employed eye-movement tracking textual relevance. For example, the advanced as the measurement of the effectiveness of techniques for objective recognition [27], [104] advertising [46], [86]. The metrics include fixa- and video concept detection [34], [36], [87] can tion (e.g., number of fixations, fixations per area benefit conceptual relevance. While fully auto- of interest, fixation duration), saccade (e.g., matic categorization or annotation still achieved number of saccades, saccade amplitude), scanpath limitedsuccess[35],[42],Internet-basedannota- (e.g., scanpath duration, scanpath length), and so tion which is characterized by collecting crowd- on [86]. sourcing knowledge, as well as combining human and computer for active tagging (i.e., ontology free annotation), is a promising direction for improv- VIII. CONCLUSIONS AND ing ad relevance. For example, a recent work pre- FUTURE CHALLENGES sents an active tagging approach to combine the Imageandvideohavebecomethemostpervasiveformats power of human and computer for recommending and compelling contents on the Internet. With the right tags to images [115]. The research on social tag- strategy and the right technology for advertising, we can ging has proceeded along another dimension start leveraging the power of visual contents to build value which aims to differentiate the tags with various in any Web site, page, image, video, and even game. In this degrees of relevance [63], 1[68]. The tags with paper, we have outlined the trend of multimedia adver- different relevance can benefit visual search tising as two generations and discussed the key issues in performance and in turn improve the relevance contextual multimedia advertising, including contextual for advertising. relevance, contextual intrusiveness, and insertion optimi- 2) Finding advertising visual keywords in images or zation. We conclude that an effective advertising for videos can make the advertisements more visually Internet multimedia should maximize the content rele- relevant. Intuitively, the ROIs in an image or vance while minimizing contextual intrusiveness at the video are naturally the most suitable regions that sametime.Wehaveobservedthatthekeytechniquesfor might attract a lot of eyeballs and thus better in- advertising come from text domain, that is, textual rele- formation carriers for advertising. The techniques vance is the basis of selecting relevant advertisements to a for detecting ROI can help to promote the ad given image or video. However, we have witnessed that the relevance and impression [18], [70]. Quality visual distinctive characteristics of multimedia have in- assessment can be used for finding advertisable spired the employment of the techniques in visual domain. video content with high visual quality [81]. The advanced techniques in computer version (e.g., visual 3) To improve advertising relevance, in addition to saliency analysis, face detection, text detection, object categorize source media, the advertisements are categorization, and so on) and multimedia retrieval (e.g., expected to be automatically classified to a prede- video structuring, video concept detection, highlight fined categories, so that it would be easy to deliver detection, and so on) have proved effective to improve specified and relevant ads to specified users. To contextual relevance and reduce contextual intrusiveness. this end, similar to the LSCOM ontology for video We further described an exemplary system for contextual domain[84],[119],weneedtobuildanontology multimedia advertising, called MediaSense, which consists for advertisements. Some preliminary works have of ImageSense, VideoSense, and GameSense, dedicated for focused on this issue [19], [25], [67]. monetizing the compelling contents of online image, 4) To improve user relevance for targeted advertis- video, and game, respectively. ing, traditional techniques for relevance feedback

14 Proceedings of the IEEE |Vol.0,No.0,2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

in multimedia information retrieval can be used to complement to existing multimedia adverting that deliver user relevant advertisements on the fly can attract viewer’ attention [59], [60]. [92],[124].Arecentworkhasattemptedtore- 2) Leveraging techniques in computer graphics to cognize human behavior and understand a user’s render the advertisements in a more visually ap- mobility from sensor data like GPS logs [123]. pealing way. Rendering effect is one of the key Based on the mobility information, the advertise- problems that will influence user experience. It is ments can be context-aware in terms of location. not true that more aggressive rendering of ad- Behavior targeting is one of the important re- vertisement is more effective [57], [74]. Sometimes, search topics in the general advertising. By study- a moderate rendering strategy can make viewers feel ing the demographic and user behavior in an more comfortable about the advertisement. For advertising ecosystem, the advertisements could example, one can show the remaining duration of be tailored to the targeted users [37], [113]. the now playing advertisement at certain position 5) To improve user involvement of the advertise- on the video or player. Another example is to enable ment, the emotional or affective models could be users to click or minimize/maximize the overlay used to compute the emotional involvement of advertisements according to their interests. video program [32], [33], [121]. This kind of in- 3) Rather than directly push the product or service volvement information has been proved to be a information around the media, we can automati- key effect on selecting suitable advertisements cally provide rich and related valuable information within a contextual program [20], [28]. along with the advertisement embedded in-media. How can we achieve the best trade-off advertising In this way, the user experience can be enriched by strategy which can satisfy different users at the same connecting more comprehensive and useful infor- time? As we have discussed in Section IV that different mation with the contents of advertisements (e.g, combinations of relevance and different computation of weather, discount, traffic, education, traditional contextual intrusiveness fit different roles (i.e., publishers, TV commercials, and so on) while users are advertisers, and viewers) in an advertising ecosystem. consuming the advertisements [26], [105]. This Although we have adopted to make the inserted advertise- would be a new advertising scenario for enriching ments look similar to the source media, which strategy is advertising service. The techniques in human the best still remains an open issue. The studies from the computer interaction such as user modeling and perspective of psychology and user behavior analysis are computer vision such as local feature based image desired. Another possible solution is to make advertising matching may help this issue. interactive and let viewers be engaged and interact with How can we adapt existing multimedia advertising publishers and advertisers. to mobile domain? As the media capture and GPS func- How can we design suitable advertising approaches tionalities have been widely equipped in mobile phones, for different media genres? As we have various PDAs, and other digital devices, mobile multimedia adver- advertising approaches in the literature, which one is the tising has become a promising direction. The distinctive- best for a specific kind of medium? For example, if the ness of mobile devices, including limited screen size and source medium is user-generated content like an amateur bandwidth, media capture (e.g., photo, video, and audio), video from Youtube [118], the intelligent overlay video GPS traces, and so on, bring quite a few challenges and advertising proposed in [31], [76] might be more suitable new applications to multimedia advertising. One scenario due to the typical short duration of the source video. is querying relevant advertising information about a pro- Otherwise, if the source medium is a premier feature duct via photo capturing through mobile phone. The movie or professional video from Hulu [43], AOL [5], or mobile search techniques in [48], [93], [110] can be used MSN Video [82], the in-video advertising [79], [80] which for searching advertisements with different modalities. is more aggressive than overlay advertising might be more Although we have summarized the multimedia adver- suitable. Another example is that if the source medium is a tising as two generations, we believe that we are now facing high resolution image with huge file size and considerable the third generation, i.e., impressionative advertising. transmission latency, then the inside image advertising Using successful applications in the so-called Web 2.0 era proposed in [65] might be the best. as references, we believe that impressionative advertising is How can we make advertising more impressive so the next big bet. The primary characteristic of impres- that users are more willing to interact with the sionative advertising is that more and more user interac- attractive advertisements? This problem can be partially tions and mash-up of multimedia applications are getting tackled from the following perspectives. more and more involved. Advertising will eventually 1) Designing advertising in a game form to make become impressive and Bgame-ized.[ Through impressio- users participate the game and get some incentives , the users will find that the ads are more simultaneously. For example, GameSense has impressive, interactive, as well as game-like compared with proved that game-like advertising is a powerful the first and second advertising generations. h

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 15 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

REFERENCES [22] H. Dai, L. Zhao, Z. Nie, J.-R. Wen, L. Wang, Workshop Multimedia Signal Process.,2008, and Y. Li, BDetecting online commercial pp. 1–5. [1] AdSense. [Online]. Available: http:// [ intention, in Proc. Int. WWW Conf., [41] X.-S. Hua, P. Yin, and H.-J. Zhang, www.google.com/adsense/ 2006. BEfficient video text recognition using [2] AdWords. [Online]. Available: http:// [23] R. Datta, D. Joshi, J. Li, and J. Z. Wang, multiple frame integration,[ in Proc. IEEE adwords.google.com/ BImage retrieval: Ideas, influences, and Int. Conf. Image Process., 2002, pp. 397–400. B [ [3] L. Ahn and L. Dabbish, Labeling images trends of the new age, ACM Comput. [42] T. S. Huang, C. K. Dagli, S. Rajaram, [ with a computer game, in Proc. SIGCHI Surveys, vol. 40, no. 65, 2008. E. Y. Chang, M. I. Mandel, G. E. Poliner, and Conf. Human Factors Comput. Syst., 2004, [24] DoubleClick. [Online]. Available: http:// D. P. W. Ellis, BActive learning for pp. 319–326. www.doubleclick.com/ interactive multimedia retrieval,[ Proc. IEEE, [4] A. Anagnostopoulos, A. Z. Border, [25] L.-Y. Duan, J. Wang, Y. Zheng, J. S. Jin, vol. 96, no. 4, pp. 648–667, Apr. 2008. E. Gabrilovich, V. Josifovski, and H. Lu, and C. Xu, BSegmentation, [43] Hulu. [Online]. Available: http:// B L. Riedel, Just-in-time contextual categorization, and identification of www.hulu.com/ advertising,[ in Proc. ACM Conf. Inform. commercials from TV streams using [44] IT Facts. [Online]. Available: http:// Knowl. Management, 2007, pp. 331–340. [ multimodal analysis, in Proc. ACM www.itfacts.biz/50-bln-digital-photos-taken- [5] AOL. [Online]. Available: http:// Multimedia, 2006, pp. 201–210. in-2007-60-bln-by-2011/8985 www.aol.com/ [26] L.-Y. Duan, J. Wang, Y. Zheng, H. Lu, and [45] iTVx. [Online]. Available: http:// B [6] N. Archak, A. Ghose, and P. Ipeirotis, J. S. Jin, Digesting commercial clips from www.itvx.com/ BShow me the money: Deriving the pricing TV streams,[ IEEE MultiMedia, vol. 15, no. 1, [46] R. J. K. Jacob and K. S. Karn, BEye tracking in power of product features by mining pp. 28–41, 2008. human-computer interaction and usability consumer reviews,[ in Proc. ACM Conf. [27] L. Fei-Fei, R. Fergus, and P. Perona, research: Ready to deliver the promises,[ in Knowl. Discovery and Data Mining, 2007. B [ One-shot learning of object categories, The Mind’s Eye: Cognitive and Applied Aspects [7] R. Baeza-Yates and B. Ribeiro-Neto, IEEE Trans. Pattern Anal. Mach. Intell., of Eye Movement Research, R. Radach, Modern Information Retrieval. vol. 128, no. 4, pp. 594–611, 2006. J. Hyona, and H. Deubel, Eds. Boston: Reading, MA: Addison Wesley, 1999. [28] T. S. Feltham and S. J. Arnold, BProgram North-Holland/Elsevier, 2003. B V [8] S. Boll, Multitube Where multimedia involvement and ad/program consistency as [47] K. Jarvelin and J. Kekalainen, BCumulated [ [ and web 2.0 could meet, IEEE Multimedia, moderators of program context effects, J. gain-based evaluation of IR techniques,[ vol. 14, no. 1, pp. 9–13, Jan.–Mar. 2007. Consumer Psychology, vol. 3, no. 1, pp. 51–77, ACM Trans. Inform. Syst., vol. 20, no. 4, [9] S. Boyd and L. Vandenberghe, Convex 1994. pp. 442–446, 2002. Optimization. Cambridge, U.K.: Cambridge [29] R. J. Gerrig and P. G. Zimbardo, Psychology [48] W. Jiang, D. Liu, X. Xie, M. R. Scott, J. Tien, University Press, 2004. and Life (16 Edition). Boston, MA: Allyn & and D. Xiang, BAn online advertisement [10] BritePic. [Online]. Available: http://www. Bacon, 2001. platform based on image content bidding,[ britepic.com/ [30] Google Image Labeler. [Online]. Available: in IEEE Int. Conf. Multimedia & Expo,2009. [11] A. Broder, M. Fontoura, V. Josifovski, and http://images.google.com/imagelabeler/; [49] A. Joshi and R. Motwani, BKeyword L. Riedel, BA semantic approach to http://images.google.com/imagelabeler/ generation for advertising,[ in contextual advertising,[ in Proc. ACM SIGIR [31] J. Guo, T. Mei, F. Liu, and X.-S. Hua, Proc. Workshops IEEE Int. Conf. Data Mining, Conf. Res. and Dev. Inform. Retrieval, 2007. BAdOn: An intelligent overlay video 2006. [ [12] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, advertising system, in Proc. ACM SIGIR [50] Jupiter Research. [Online]. Available: http:// BExtracting content structure for web pages Conf. Res. Dev. Inform. Retrieval, 2009, www.jupiterresearch.com/ [ pp. 628–629. based on visual representation, in Fifth [51] G. Kastidou and R. Cohen, BAn approach B Asia Pacific Web Conf., 2003. [32] A. Hanjalic, Extracting moods from for delivering personalized ads in [13] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, pictures and sounds: Towards truly interactive TV customized to both users [ ‘‘VIPS: A vision-based page segmentation personalized TV, IEEE Signal Process. and advertisers,[ in Proc. Eur. Conf. algorithm,’’ Microsoft Technical Report Mag., vol. 23, no. 2, pp. 90–100, Interactive Television, 2006. Mar. 2006. (MSR-TR-2003-79), 2003. [52] L. Kennedy, S.-F. Chang, and B [14] D. Chakrabarti, D. Agarwal, and [33] A. Hanjalic and L.-Q. Xu, Affective A. Natsev, BQuery-adaptive fusion for [ V. Josifovski, BContextual advertising by video content representation and modeling, multimodal search,[ Proc. IEEE, vol. 96, combining relevance with click feedback,[ in IEEE Trans. Multimedia, vol. 7, no. 1, no. 4, pp. 567–588, 2008. pp. 143–154, 2005. Proc. Int. WWW Conf., 2008, pp. 417–426. [53] L. Kennedy, M. Naaman, S. Ahern, R. Nair, [15] C.-H. Chang, K.-Y. Hsieh, M.-C. Chung, and [34] A. G. Hauptmann, M. Christel, and and T. Rattenbury, BHow Flickr helps us B J.-L. Wu, BViSA: Virtual spotlighted R. Yan, Video retrieval based make sense of the world: Context and [ advertising,[ in Proc. ACM Multimedia, 2008, on semantic concepts, Proc. IEEE, content in community-contributed media pp. 837–840. pp. 602–622, 2008. collections,[ in Proc. ACM Multimedia, [16] P. Chatterjee, D. L. Hoffman, and [35] A. G. Hauptmann, W. H. Lin, R. Yan, Augsburg, Germany, 2007. B T. P. Novak, BModeling the clickstream: J. Yang, and M. Y. Chen, Extreme video [54] A. Lacerda, M. Cristo, M. A. Goncalves et al., Implications for web-based advertising retrieval: Joint maximization of human and BLearning to advertise,[ in Proc. ACM SIGIR [ efforts,[ Marketing Science, vol. 22, no. 4, computer performance, in Proc. ACM Int. Conf. Res. Dev. Inform. Retrieval,2006. Conf. Multimedia, Santa Barbara, 2006. pp. 520–541, 2003. [55] G. Lekakos, D. Papakiriakopoulos, and [17] X. Chen and H.-J. Zhang, BText area [36] A. G. Hauptmann, R. Yan, W.-H. Lin, K. Chorianopoulos, BAn integrated B detection from video frames,[ in Proc. M. Christel, and H. Wactlar, Can high-level approach to interactive and personalized IEEE Pacific Rim Conf. Multimedia, 2001, concepts fill the semantic gap in video TV advertising,[ in Proc. Workshop on [ pp. 222–228. retrieval? A case study with broadcast news, Personalization in Future TV, 2001. IEEE Trans. Multimedia, vol. 9, no. 5, [18] Y.-Y. Chuang, BA collaborative benchmark [56] D. Li, B. Wang, Z. Li, N. Yu, and M. Li, pp. 958–966, 2007. for region of interest detection algorithms,[ BOn detection of advertising images,[ in in Proc. IEEE Conf. Comput. Vision Pattern [37] J. Hu, H.-J. Zeng, H. Li, C. Niu, and Z. Chen, IEEE Int. Conf. Multimedia & Expo, 2007. BDemographic prediction based on user’s Recognit., 2009. [57] H. Li, S. M. Edwards, and J.-H. Lee, browsing behavior,[ in Proc. Int. WWW Conf., [19] C. Colombo, A. D. Bimbo, and P. Pala, BMeasuring the intrusiveness of 2007. BRetrieval of commercials by semantic advertisements: Scale development content: The semiotic perspective,[ [38] X.-S. Hua, L. Lu, and H.-J. Zhang, and validation,[ J. Advertising, B Multimedia Tools and Applications, vol. 13, Optimization-based automated home vol. 31, no. 2, pp. 37–47, 2002. video editing system,[ IEEE Trans. Circuit pp. 93–118, 2001. [58] H. Li, D. Zhang, J. Hu, H.-J. Zeng, and Syst. for Video Technol., vol. 14, no. 5, [20] K. S. Coulter, BThe effects of affective Z. Chen, BFinding keyword from online pp. 572–583, 2004. response to media context on advertising broadcasting content for targeted B evaluations,[ J. Adv., vol. XXVII, no. 4, [39] X.-S. Hua, L. Lu, and H.-J. Zhang, Robust advertising,[ in Int. Workshop on Data Mining [ pp. 41–51, 1998. learning-based TV commercial detection, in and Audience Intelligence for Advertising, Proc. ICME, Jul. 2005. [21] H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma, 2007. BProbabilistic query expansion using query [40] X.-S. Hua, T. Mei, and S. Li, [59] L. Li, T. Mei, and X.-S. Hua, BGameSense: B logs,[ in Proc. Int. Conf. WWW, 2002, When multimedia advertising meets Game-like in-image advertising,[ Multimedia [ pp. 325–332. the new internet era, in Proc. IEEE Int. Tools and Applications, 2009.

16 Proceedings of the IEEE | Vol. 0, No. 0, 2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

[60] L. Li, T. Mei, C. Liu, and X.-S. Hua, IEEE Trans. Circuits Syst. Video Technol., Workshop on Personalization in Future TV, BGameSense,[ in Proc. Int. WWW Conf., vol. 19, no. 12, pp. 1866–1879, Dec. 2009. 2004. Madrid, Spain, Apr. 2009, pp. 1097–1098. [80] T. Mei, X.-S. Hua, L. Yang, and S. Li, [98] TRECVID. [Online]. Available: http:// [61] S. Z. Li, L. Zhu, Z. Zhang, A. Blake, BVideoSense: Towards effective online www-nlpir.nist.gov/projects/trecvid/ B [ H.-J. Zhang, and H. Shum, Statistical video advertising, in Proc. ACM Multimedia, [99] S. Velusamy, L. Gopal, S. Bhatnagar, and [ learning of multi-view face detection, in Augsburg, Germany, 2007, pp. 1075–1084. S. Varadarajan, BAn efficient ad Proc. Eur. Conf. Computer Vision, Copenhagen, [81] T. Mei, X.-S. Hua, C.-Z. Zhu, H.-Q. Zhou, recommendation system for TV programs,[ Denmark, May 2002, pp. 67–81. and S. Li, BHome video visual quality Multimedia Syst., vol. 14, pp. 73–87, [62] T. Li, T. Mei, I.-S. Kweon, and X.-S. Hua, assessment with spatiotemporal factors,[ Jul. 2008. B M [ Video : Multi-video synopsis, in 1st Int. IEEE Trans. Circuits Syst. Video Technol., [100] Vibrant Media. [Online]. Available: http:// Workshop on Video Mining, in Conjunction vol. 17, no. 6, pp. 699–706, Jun. 2007. www.vibrantmedia.com/ With the IEEE Int. Conf. Data Mining,2008, [82] MSN Video. [Online]. Available: http:// [101] Videoegg. [Online]. Available: http:// pp. 854–861. video.msn.com/ www.videoegg.com/ [63] X. Li, C. G. M. Snoek, and M. Worring, [83] V. Murdock, M. Ciaramita, and [102] E. M. Voorhees and D. K. Harman, TREC: BLearning tag relevance by neighbor voting B V. Plachouras, A noisy-channel approach Experiment and Evaluation in Information for social image retrieval,[ in Proc. ACM [ to contextual advertising, in 1st Int. Retrieval. Cambridge, MA: MIT Press, Multimedia Inform. Retrieval,2008. Workshop on Data Mining and Audience 2005. [64] Y. Li, K. Wan, X. Yan, and C. Xu, Intelligence for Advertising, 2007, pp. 96–99. [103] K. Wan, X. Yan, X. Yu, and C. Xu, BAdvertisement insertion in baseball [84] M. Naphade, J. R. Smith, J. Tesic, BRobust goal-mouth detection for virtual video based on advertisement effect,[ in S.-F. Chang, W. Hsu, L. Kennedy, content insertion,[ in Proc. ACM Multimedia, Proc. ACM Multimedia, 2005, pp. 343–346. B A. Hauptmann, and J. Curtis, Large-scale Nov. 2003, pp. 468–469. [65] Z. Li, L. Zhang, and W.-Y. Ma, BDelivering concept ontology for multimedia,[ IEEE [104] G. Wang and D. Forsyth, BObject image online advertisements inside images,[ in MultiMedia, vol. 13, no. 3, pp. 86–91, retrieval by exploiting online knowledge Proc. ACM Multimedia, 2008, pp. 1051–1060. Jul.–Sep. 2006. resources,[ in Proc. IEEE Conf. Comput. Vis. [66] W.-S. Liao, K.-T. Chen, and W. H. Hsu, [85] Online Publishers. [Online]. Available: Pattern Recog.,2008. BAdimage: Video advertising by image http://www.online-publishers.org/ [105] J. Wang and L.-Y. Duan, BLinking video ads matching and ad scheduling optimization,[ B [86] A. Poole and L. J. Ball, Eye tracking in with product or service information by in Proc. ACM SIGIR Conf. Res. Dev. Inform. human-computer interaction and usability web search,[ in IEEE Int. Conf. Multimedia Retrieval, 2008, pp. 767–768. research: Current status and future &Expo,2009. [67] R. Lienhart, C. Kuhmunch, and prospects,[ in Encyclopedia of Human [106] M. Weideman and T. Haig-Smith, W. Effelsberg, BOn the detection and Computer Interaction, Idea Group, 2004. BAn investigation into search engines as a recognition of television commercials,[ in [87] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, form of targeted advert delivery,[ in South Proc. Int. Conf. Multimedia Comput. Syst., B and H.-J. Zhang, Correlative multi-label African Institute of Computer Scientists and Ottawa, Canada, 1997, pp. 509–516. [ video annotation, in ACM Multimedia, Information Technologies on Enablement [68] D. Liu, X.-S. Hua, L. Yang, M. Wang, and Augsburg, Germany, Sep. 2007, pp. 17–26. Through Technology, 2002, pp. 258–258 H.-J. Zhang, BTag ranking,[ in Proc. Int. [88] Revver. [Online]. Available: http:// [107] D. Whitley, BA genetic algorithm tutorial,[ WWW Conf., 2009. one.revver.com/revver Stat. Comput., vol. 4, pp. 65–85, 1994. [69] H. Liu, S. Jiang, Q. Huang, and C. Xu, [89] B. Ribeiro-Neto, M. Cristo, P. B. Golgher, [108] Wikipedia. [Online]. Available: http:// BA generic virtual content insertion B and E. S. Moura, Impedance coupling in en.wikipedia.org/wiki/click-through_rate system based on visual attention content-targeted advertising,[ in Proc. ACM [ [109] L. Wu, X.-S. Hua, N. Yu, W.-Y. Ma, and S. Li, analysis, in Proc. ACM Int. Conf. SIGIR Conf. Res. Devel. Inform. Retrieval, BFlickr distance,[ in Proc. ACM Multimedia, Multimedia, 2008, pp. 379–388. 2005. Vancouver, Canada, 2008, pp. 31–40. [70] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and [90] M. Richardson, E. Dominowska, and B [110] X. Xie, L. Lu, M. Jia, H. Li, S. Frank, and H.-Y. Shum, Learning to detect a salient R. Ragno, BPredicting clicks: Estimating [ W.-Y. Ma, BMobile search with multimodal object, in Proc. IEEE Conf. Comput. Vis. click-through rate for new ads,[ in queries,[ Proc. IEEE, vol. 96, no. 4, Pattern Recog., 2007, pp. 1–8. Proc. Int. WWW Conf., 2007. B pp. 589–601, Apr. 2008. [71] D. G. Lowe, Distinctive image features from [91] C. Rohrer and J. Boyd, BThe rise of intrusive [ [111] C. Xu, K. W. Wan, S. H. Bui, and Q. Tian, scale-invariant keypoints, Int. J. Comput. online advertising and the response of BImplanting virtual advertisement into Vis., vol. 60, no. 2, pp. 91–110, 2004. user experience research at Yahoo![ in Proc. broadcast soccer video,[ in Proc. IEEE Pacific [72] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, SIGCHI Conf. Human Factors Comput. Syst., Rim Conf. Multimedia, 2004, pp. 264–271. BA user attention model for video 2004. [ [112] Yahoo!. [Online]. Available: http:// summarization, in Proc. ACM Multimedia, [92] Y. Rui, T. S. Huang, M. Ortega, and www.yahoo.com/ 2002, pp. 533–542. S. Mehrotra, BRelevance feedback: [73] Y.-F. Ma and H.-J. Zhang, BContrast-based A power tool for interactive content-based [113] J. Yan, N. Liu, G. Wang, W. Zhang, Y. Jiang, B image attention analysis by using fuzzy image retrieval,[ IEEE Trans. Circuits and Z. Chen, How much can behavioral growing,[ in Proc. ACM Multimedia, Video Technol., vol. 8, no. 5, pp. 644–655, targeting help online advertising?’’ in Nov. 2003, pp. 374–381. Sep. 1998. Proc. WWW, 2009, pp. 261–270. [74] S. Mccoy, A. Everard, P. Polak, and [93] M. R. Scott, W. Jiang, X. Xie, J. Tien, [114] B. Yang, T. Mei, X.-S. Hua, L. Yang, B D. F. Galletta, BThe effects of online G. Chen, and D. Xiang, BVisiAds: S.-Q. Yang, and M. Li, Online video advertising,[ Commun. ACM, vol. 50, no. 3, A vision-based advertising platform for recommendation based on multimodal [ pp. 84–88, 2007. camera phones,[ in IEEE Int. Conf. fusion and relevance feedback, in Proc. ACM Int. Conf. Image Video Retrieval, 2007. [75] A. Mehta, A. Saberi, U. Vazirani, and Multimedia & Expo, 2009. B V. Vazirani, BAdwords and generalized [94] D. Shen, J.-T. Sun, Q. Yang, and [115] K. Yang, M. Wang, and H.-J. Zhang, Active [ on-line matching,[ J. ACM, vol. 54, no. 5, Z. Chen, BBuilding bridges for web query tagging for image indexing, in Proc. Int. Oct. 2007. classification,[ in Proc. ACM SIGIR Conf. Workshop Internet Multimedia Search Mining, in Conjunction With ICME,2009. [76] T. Mei, J. Guo, X.-S. Hua, and F. Liu, Res. Devel. Inform. Retrieval, 2006. B BAdOn: Toward contextual overlay in-video [95] C. G. M. Snoek, M. Worring, [116] Y. Yang and X. Liu, A re-examination of [ advertising,[ Multimedia Syst., 2010. J. C. van Gemert, J. M. Geusebroek, and text categorization methods, in Proc. ACM B SIGIR Conf. Res. Devel. Inform. Retrieval, [77] T. Mei, X.-S. Hua, W. Lai, L. Yang et al., A. W. M. Smeulders, The challenge 1999. BMSRA-USTC-SJTU at TRECVID problem for automated detection of [ 2007: High-level feature extraction and 101 semantic concepts in multimedia, in [117] W.-T. Yih, J. Goodman, and V. R. Carvalho, B search,[ in TREC Video Retrieval Evaluation Proc. ACM Int. Conf. Multimedia, Finding advertising keywords on web [ Online Proc., 2007. Santa Barbara, 2006. pages, in Proc. Int. WWW Conf., 2006. [78] T. Mei, X.-S. Hua, and S. Li, BContextual [96] S. H. Srinivasan, N. Sawant, and S. Wadhwa, [118] YouTube. [Online]. Available: http:// B [ in-image advertising,[ in Proc. ACM vADeo: Video advertising system, in Proc. www.youtube.com/ Multimedia, Vancouver, Canada, 2008, ACM Multimedia, 2007, pp. 455–456. [119] Z.-J. Zha, T. Mei, Z. Wang, and X.-S. Hua, pp. 439–448. [97] A. Thawani, S. Gopalan, and V. Sridhar, BBuilding a comprehensive ontology to B [ [79] T. Mei, X.-S. Hua, and S. Li, BVideoSense: Context aware personalized ad insertion refine video concept detection, in [ A contextual in-video advertising system,[ in an interactive TV environment, in Proc. Proc. ACM SIGMM Int. Conf. Workshop

Vol. 0, No. 0, 2010 | Proceedings of the IEEE 17 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Mei and Hua: Contextual Internet Multimedia Advertising

Multimedia Inform. Retrieval, Augsburg, arousal and valence features,[ in Proc. ICME, [123] Y. Zheng, Q. Li, Y. Chen, and X. Xie, Germany, Sep. 2007, pp. 227–236. 2008, pp. 1369–1372. BUnderstanding mobility based on GPS [ [120] H.-J. Zhang, A. Kankanhalli, and [122] L. Zhao, W. Qi, Y.-J. Wang, S.-Q. Yang, and data, in Proc. ACM Conf. Ubiquitous Comput., S. W. Smoliar, BAutomatic partitioning H.-J. Zhang, BVideo shot grouping using Seoul, Korea, 2008, pp. 312–321. of full-motion video,[ Multimedia Syst., best first model merging,[ in Proc. Storage [124] X. S. Zhou and T. S. Huang, vol. 1, no. 1, pp. 10–28, Jun. 1993. and Retrieval for Media Database, 2001, BRelevance feedback for image retrieval: [ [121] S. Zhang, Q. Tian, S. Jiang, Q. Huang, and pp. 262–269. A comprehensive review, Multimedia Syst., W. Gao, BAffective mtv analysis based on vol. 8, no. 6, pp. 536–544, Apr. 2003.

ABOUT THE AUTHORS Tao Mei (Member, IEEE) received the B.E. degree Xian-Sheng Hua (Member, IEEE) received the B.S. in automation and the Ph.D. degree in pattern and Ph.D. degrees from Peking University, Beijing, recognition and intelligent systems from the China, in 1996 and 2001, respectively, both in University of Science and Technology of China, applied mathematics. When he was in Peking Hefei, in 2001 and 2006, respectively. He joined University, his major research interests were in the Microsoft Research Asia, Beijing, China, as a areas of image processing and multimedia water- Researcher Staff Member, in 2006. He was a marking. Since 2001, he has been with Microsoft visiting professor of the Xidian University, Xi’an, Research Asia, Beijing, where he is currently a Lead China, during 2009–2012. His current research Researcher with the media computing group. His interests include multimedia content analysis, current research interests are in the areas of video computer vision, and internet multimedia applications such as search, content analysis, multimedia search, management, authoring, sharing, advertising, management, social network, and mobile applications. He is mining, advertising and mobile multimedia computing. He has authored or the author of one book, five book chapters, and over 80 journal and co-authored more than 160 publications in these areas and has more than conference papers in these areas, and holds more than 20 filed patents or 40 filed patents or pending applications. He is now an adjunct professor of pending applications. University of Science and Technology of China, and serves as an Associate Dr. Mei serves as an Editorial Board Member of Journal of Multimedia, Editor of IEEE TRANSACTIONS ON MULTIMEDIA, Associate Editor of ACM a Guest Editor for IEEE Multimedia Magazine, ACM/Springer Multimedia Transactions on Intelligent Systems and Technology, Editorial Board Systems,andJournal of Visual Communication and Image Representa- Member of Advances in Multimedia and Multimedia Tools and Applications, tion. He was the principle designer of the automatic video search system and editor of Scholarpedia (Multimedia Category). Dr. Hua won the Best that achieved the best performance in the worldwide TRECVID evaluation Paper Award and Best Demonstration Award in ACM Multimedia 2007, Best in 2007. He received the Best Paper and Best Demonstration Awards in Poster Paper Award in 2008 IEEE International Workshop on Multimedia the ACM International Conference on Multimedia 2007, the Best Poster Signal Processing, and Best Student Paper in ACM Conference on Paper Award in the IEEE International Workshop on Multimedia Signal Information and Knowledge Management 2009. He also won 2008 MIT Processing 2008, and the Best Paper Award in the ACM International Technology Review TR35 Young Innovator Award, and named as one of the Conference on Multimedia 2009. BBusiness Elites of People under 40 to Watch[ by Global Entrepreneur. Dr. Mei is a member of the Association for Computing Machinery. Dr. Hua is a Senior Member of the Association for Computing Machinery.

18 Proceedings of the IEEE | Vol. 0, No. 0, 2010 Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.