Contextual In-Image

Tao Mei, Xian-Sheng Hua, Shipeng Li Microsoft Research Asia 5F Sigma Center, 49 Zhichun Road, Beijing 100190, P. R. China {tmei, xshua, spli}@microsoft.com

ABSTRACT so called Web 2.0 wave) have led to the countless private The community-contributed media contents over the Inter- image collections on local computing devices such as per- net have become one of the primary sources for online adver- sonal computers, cell phones, and personal digital assistants tising. However, conventional ad-networks such as Google (PDAs), as well as the huge yet increasing public image col- AdSense treat image and video advertising as general text lections on the Internet. Compared with text and video, advertising without considering the inherent characteristics image has some unique advantages: it is more attractive of visual contents. In this work, we propose an innovative than plain text, and it has been found to be more salient contextual advertising system driven by images, which auto- than text [9], thus it can grab users’ attention instantly; it matically associates relevant ads with an image rather than carries more information that can be comprehended more the entire text in a Web page and seamlessly inserts the quickly, just like an old saying, a picture is worth thousands ads in the nonintrusive areas within each individual image. of words; and it can be shown to the users faster than video. The proposed system, called ImageSense, represents the first As a result, image has become one of the most pervasive me- attempt towards contextual in-image advertising. The rele- dia formats on the Internet [13]. On the other hand, we have vant ads are selected based on not only textual relevance but witnessed a fast and consistently growing also visual similarity so that the ads yield contextual rele- market in recent years. Motivated by the huge business op- vance to both the text in the Web page and the image con- portunities in the online advertising market, people are now tent. The ad insertion positions are detected based on image actively investigating new Internet advertising models. To saliency to minimize intrusiveness to the user. We evaluate take advantages of the image form of information represen- ImageSense on three photo-sharing sites with around one tation, image-driven contextual advertising, which associates million images and 100 Web pages collected from several ma- advertisements with an image or a Web page containing im- jor sites, and demonstrate the effectiveness of ImageSense. ages, has become an emerging online monetization strategy. Many existing ad-networks such as Google AdSense [1], Yahoo! [32], and BritePic [4], have provided contextual ad- Categories and Subject Descriptors vertising services around images. However, conventional H.3.3 [Information Storage and Retrieval]: Information image advertising primarily uses text content rather than Search and Retrieval—Selection process; H.3.5 [Information image content to match relevant ads. In other words, im- Storage and Retrieval]: Online Information Services— age advertising is usually treated as general text advertis- Web-based services ing without considering the potential advantages could be brought by images. There is no existing system to auto- General Terms matically monetize the opportunities brought by individual image. As a result, the ads are only generally relevant to Algorithms, Experimentation, Human Factors. the entire Web page rather than specific to images it con- tained. Moreover, the ads are embedded at a predefined po- Keywords sition in a Web page adjacent to the image, which normally Web page segmentation, image saliency, image advertising. destroys the visually appealing appearance and the struc- ture of original web page. It could not grab and monetize 1. INTRODUCTION users’ attention aroused by these compelling images. Fig- ure 1 (a) and (b) show some exemplary image ad-networks, The proliferation of digital capture devices and the explo- where Google AdSense [1] displays the ads side by side with sive growth of online social media (especially along with the the image, while BritePic [4] embeds the ads at a corner of the image. It is worth noting that although BritePic [4] also supports embedding the ads within an image, the ad posi- Permission to make digital or hard copies of all or part of this work for tions are most likely predefined without considering whether personal or classroom use is granted without fee provided that copies are these positions would be intrusive to the users. not made or distributed for profit or commercial advantage and that copies We believe that it is not appropriate to treat image ad- bear this notice and the full citation on the first page. To copy otherwise, to vertising as general text advertising. One point we want to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. make here is that you can view online image contents as MM’08, October 26–31, 2008, Vancouver, British Columbia, Canada. an effective information carrier for commercials. The more Copyright 2008 ACM 978-1-60558-303-7/08/10 ...$5.00. (a) AdSense [1] (b) BritePic [4] (c) ImageSense

Figure 1: The exemplary image advertising schemes. The highlighted rectangle areas indicate the associated ads. The existing image ad-networks like (a) and (b) match ads with Web pages only based on textual relevance and embed ads at a preserved position, while (c) associates ads not only with Web page but also with the image content and seamlessly embeds the ads at a non-intrusive position within the image. compelling the image contents, the more audience will view (c) gives an example of ImageSense. It is observed that a them, then the more revenue will be generated. The follow- relevant ad (i.e., the NBA logo) has been inserted into the ing distinctions between image and text advertising motivate region with less information within this image. Please note a new advertising model dedicated to images. that this ad logo can also progressively appear, and then disappear after a few seconds without affecting user viewing • We believe that beyond the traditional media of Web experience, which is less intrusive and more visually pleasant pages and videos, images can be more powerful and than conventional advertising. effective carriers of online advertising. The rest of the paper is organized as follows. Section 2 reviews closely related research on advertising. Section 3 • We believe that the ads should be dynamically embed- provides a system overview of ImageSense. The details of ded at the appropriate positions within each individual ImageSense are described in Section 4. Section 5 gives the image (i.e., in-image) rather than at a predefined po- experiments, followed by conclusions in Section 6. sition in the Web page. Based on this claim, it is reasonable to assume the ads should be mostly image 2. RELATED WORK logos, rather than just textual descriptions, which are more suited to be seamlessly embedded into images. Driven by the huge online business opportunities, online advertising has become emerging areas of research. One of • We believe that the ads should be locally relevant to the fundamental problems in online advertising is ad rele- image content and its surrounding text, rather than vance which in studies detracts from user experience and globally relevant to the entire Web page. Therefore, it increases the probability of reaction [19]. The other one is is reasonable to assume that the image content and its how to pick suitable keywords or Web pages for advertising. surrounding text should have much more contributions The literature review will focus on two key problems: 1) ad to the relevance matching than the whole Web page. keyword selection, and 2) ad relevance matching. Typical advertising systems analyze a Web page or query Motivated by the above observations, we propose in this to find prominent keywords or categories, and then match paper an innovative in-image advertising model to automat- these keywords or categories against the words for which ically deliver contextually relevant ads by an online advertis- advertisers bid. If there is a match, the corresponding ads ing system. The proposed system, named ImageSense, sup- will be displayed to the user through the web page. Yih et ports contextual image advertising by associating the most al. has studied a learning-based approach to automatically relevant ads to an online image and seamlessly embedding extracting appropriate keywords from Web pages for adver- the ads at the most appropriate positions within this image. tisement targeting [35]. Instead of dealing with general Web The ads are selected to be globally relevant to the Web page pages, Li et al. proposed a sequential pattern mining-based content, as well as locally relevant to the image content and method to discover keywords from a specific broadcasting its surrounding text. Meanwhile, the ads are embedded at content domain [16]. In addition to Web pages, queries also the most non-salient positions within the image. By lever- play an important role in paid . In [29], aging visual content analysis, we are on the positive side to the queries are classified into an intermediate taxonomy so better solve two challenges in online advertising, that is, ad that the selected ads are more targeted to the query. relevance and ad position. Therefore, ImageSense is able to Research on ad relevance has proceeded along three di- achieve effective advertising which is a powerful complement rections from the perspective of what the ads are matched to conventional contextual advertising. To the best of our against: (1) keyword- (also called “paid knowledge, this is one of the first attempts to investigate search advertising” or “sponsored search”) in which the ads how visual content (both high-level concepts and low-level are matched against the originating query [12] [20], (2) content- appearance) can benefit online image advertising. Figure 1 targeted advertising (also called “contextual advertising”) in

T

CD D D D T T } ~ T  T ‚ T

|

‚ T

D D D

E F G H I JE K L MJN K O K N P K GQ JQ FG JGL RPJNK S FUV WKQKUQ FG

FIJ FUJ K € Q  J K K ƒ J G UK

K K ƒ J G UK Š ‹Q P JQ FG

Œ

z] eq a q ] _

c

f _ g h] ^ _ i _ j

c

¡¢ £

‡ qp

‡ q ˆa p

c

[\ ] ^ _

` a b _ da e^

c

k b lm n g] p _ q uj v] ep a l e

D

KIE QK o

Z

¡¢ ¤¢

r s d dl s eq t i _ j i _ j

c c

i _ j s] b

c

£

¡ ¢ ¤¢ £

_ b _ y ] em _

zl em _ v w a^ x b _ y _ b

c

o

@

A B 1 3 6

†] m xa e^

c

{_ _ m l d z l em _ v p

c c c

‰

©

¨¦  

• ¥ ¦ § ¨



© 

T

XY Y

J K

Z   

•  ¨

   

•   ¨

  

© © 

!

   §

• [ \ ] ^ _

¡¢ ¤¢ ¤

 $   #

©

" 

¦ %  § § §

r ] b a _ em _

 a p s] b ‡ q [ \ ] ^ _

)



 '( 

o

¨ ¥ &

{_ _ m a l e

c c

_ b _ y ] em _ k b l m n

/0

* +,, +. .

- 1 2 3 4 5 6

†] m xa e^ † ] m xa e^

c c

$

9

© ©



¦ % ¨  7 8   ¦ : § ¦  ¦ ; < § 

0

* . .

Ž

3  B 1 3 3 1 A 3 5 B 5 -

Y

© =  © „ D D D

L R GE K Q FG MF GQ WKQKUQ FG

8 § < && > ¦ §  ?

Figure 2: System framework of ImageSense. which the ads are associated with the Web page content candidate ad insertion positions are detected in each image rather than the keywords [5] [26], and (3) user-targeted ad- through visual saliency analysis. The ads will be reranked vertising (also called “audience intelligence”) in which the by simultaneously considering local content relevance which ads are driven based on user profile and demography [11], is derived from the visual similarity between each ad and or behavior [27]. Although the paid search market develops ad insertion point, as well as the intrusiveness of ad inser- quicker than contextual advertising market, and most tex- tion point. Finally, the most contextually relevant ads will tual ads are still characterized by “bid phrases,” there has be embedded at the most non-intrusive position within each been a drift to contextual advertising as it supports a long- image by a relevance optimization module. tail business model [15]. For example, a recent work [26] examines a number of strategies to match ads to Web pages based on extracted keywords. To alleviate the problem of 4. IMAGESENSE exact keyword match in conventional contextual advertising, In this section, we describe the implementation of Image- Broder et al. propose to integrate semantic phrase into tra- Sense. We will show how the tasks of associating ads with ditional keyword matching [5]. Specifically, both the pages the triggering page and image are formulated as an opti- and ads were classified into a common large taxonomy, which mization problem and how this problem is practically solved was then used to narrow down the search of keywords to using a heuristic searching approach. concepts. Most recently, Hu et al. propose to predict user demographics from browsing behavior [11]. The intuition is 4.1 Problem Definition that while user demographics are not easy to obtain, brows- Without loss of generality, we define the task of Image- ing behaviors indicate a user’s interest and profile. Sense as the association of ads with a Web page. We will Recently, researchers have invented context-aware video show that it is easy to extend to the tasks of image- and ads that can be inserted into video clips using intelligent - based advertising in a single formulation. Let I de- video content analysis techniques [22] [30]. However, due to note an image in the Web page P, which consists of Nb ad current network bandwidth limitations, video is still not as Nb insertion points, represented by B = {bi}i=1. These inser- popular in Web pages as text and image. tion points are usually blocks or regions in the image. Let A denote the ad database containing Na ads, represented by Na A = {aj }j=1. The task of ImageSense can be described as 3. SYSTEM OVERVIEW given an image I with a set of ad insertion points B and an ad Figure 2 illustrates the system framework of ImageSense, database A, select N elements from B and A, and associate which consists of four components, i.e., vision-based page each aj ∈A with an appropriate bi ∈B. N is the number of segmentation and image block detection, textual relevance expected ads to be inserted into each image, which can be matching, ad insertion point detection, and relevance opti- given by the publisher. Note that in conventional contex- mization. Given a Web page which contains some images, a tual advertising (e.g, Google AdWords and AdSense), the vision-based page segmentation algorithm is applied to par- position of selected ads is usually fixed to a preserved block tition the page into several blocks with consistent content. within the Web page. Thus, the main focus is only the se- The images suitable for advertising are then picked and rep- lection of the most relevant ads to the Web page. However, resented by visual content and surrounding text (such as a set of appropriate ad positions is automatically detected the titles and descriptions) within the corresponding blocks. within each individual image in ImageSense. Therefore, not A textual relevance module ranks the ads according to the only the relevant ads to the image but also the association of global and local textual relevance between the ads and the the ads with the detected insertion points should be decided. content of Web page and images. The textual information As we have mentioned in Section 3 that the ads are rele- of each image comes from the text of the Web page, the vant not only to the Web page content but also to the local surrounding text in the corresponding block, the expansion surrounding text, as well as the neighboring image blocks words of the surrounding text, and the automatically recog- around the insertion points, we introduce the following three nized concepts based on visual content analysis. Meanwhile, relevance items for each ad. • Global textual relevance Rg(P, aj )—the textual rel- 4.2 Preprocess: Vision-based Web Page Seg- evance between the Web page P and the ad aj , which mentation and Image Block Detection is given by the cosine similarity in the vector space Given a Web page, it is desirable to first segment it into model, as in most of conventional advertising systems several blocks with coherent topic, detect the blocks with [1] [14] [26]. Since P is always given, we can drop P suitable images for advertising, and extract the semantic and rewrite such relevance as Rg(aj ). structure such as the images and their surrounding texts from these blocks. Therefore, the image content and sur- • Local textual relevance R`(I, aj )—the textual rel- rounding text obtained in this step can be further used for evance between the image I and the ad aj , where the computing the local textual and content relevance. text associated with I comes from not only the sur- The Vision-based Page Segmentation (VIPS) algorithm rounding text in the corresponding page block, but [6] is adopted to extract the semantic structure of a Web also the hidden concepts recognized by visual content page based on its visual presentation. The VIPS algorithm analysis. Since I is always given, we can also drop I first extracts all the suitable blocks from the Document Ob- and rewrite such relevance as R`(aj ). ject Model (DOM) tree in html, and then finds the separa- tors between these blocks, where the separators denote the • Local content relevance Rc(bi, aj )—the combina- horizontal or vertical lines in a page visually crossing with- tion of content-based visual similarity between the ad out blocks. Based on these separators, a Web page can be insertion position bi and the ad aj , as well as the non- represented by a semantic tree in which each leaf node corre- intrusiveness of bi for inserting the ads. To minimize sponds to a block. In this way, contents with different topics intrusiveness to the user, the ads are to be inserted into are distinguished as separate blocks in a Web page. Figure the most non-intrusive positions, and to be similar to 3 illustrates the vision-based structured of a sample page. the corresponding neighboring image blocks. It is observed that this page has two main blocks and the block “VB-1-2-1-2” is detected as an image block. To support effective image advertising, the three kinds The VIPS algorithm can be naturally used for surrounding of relevances should be simultaneously maximized. We in- texts extraction. Specifically, after obtaining all the blocks tegrate the above properties into an optimization problem. Suppose we introduce the following design variables x ∈ via VIPS, we elaborately select the images which are suit- RNb RNa T able for advertising in the Web page. Intuitively, the images , y ∈ , x = [x1,...,xNb ] , xi ∈ {0, 1}, and y = T with poor visual qualities, or belonging to the advertise- [y ,...,yN ] , yj ∈{0, 1}, where xi and yj indicate whether 1 a ments (usually placed in certain positions of a page) or dec- bi and aj are selected (xi =1,yj = 1)ornot(xi =0,yj = 0). The above expectation can be formulated as the following orations (usually are too small), are first filtered out. Then, nonlinear 0-1 integer programming problem (NIP) [3], the corresponding blocks with the remaining images are se- lected as the advertising page blocks. The surrounding texts

Na Na X X (e.g., title and description) are used to describe each image. max f(x, y) = wg yj Rg(aj )+ w` yj R`(aj ) (1) (x,y) j=1 j=1 4.3 Global and Local Textual Relevance

Nb Na X X After page segmentation, each suitable image I can be +w x y R (b ,a ) c i j c i j represented by I = {I(V ), I(T )}, where I(V ) and I(T ) denote i=1 j=1 the visual and textual information associated with I, respec- T T T = wgy Rg + w`y R` + wcx Rcy tively. In general, I(T ) is composed of an image title and a

Nb Na X X textual description in the corresponding page block. Since s.t. xi = N, yj = N, xi,yj ∈{0, 1} expansion terms of the triggering page can improve textual i=1 j=1 relevance [26], we expand the title and the description by classifying the textual content into a category of a predefined T where Rg =[Rg(a1), Rg(a2), ..., Rg(aNa )] , R` =[R`(a1), hierarchy. The predefined taxonomy consists of more than T RNb×Na R`(a2), ..., R`(aNa )] , Rc ∈ . The parameters 1,000 words [33] which are trained by text categorization (wg, w`, wc) control the emphasis on global and local tex- techniques based on Support Vector Machine (SVM) [34]. tual relevance, as well as local content relevance, and satisfy Furthermore, to deal with the case that there are very few the constraints: 0 6 wg, w`, wc 6 1 and wg + w` + wc = 1. words associated with image I, we leverage concept detec- The parameters of (wg, w`, wc) can be empirically set to tion to classify the visual content of each image into a set of 1 wc < w` < wg, as it is reasonable to assume that tex- predefined concepts [21] . These concepts are elaborately tual relevance is more important than image content rele- selected from a light ontology used in TRECVID commu- vance [10] [21]. Alternatively, the parameters can be trained nity [31] in terms of frequent presence in real-world 2. We by cross-validation experiments. have achieved 11.92% in terms of the mean average preci- Although the above equation is based on the assumption sion (MAP) on these concepts in TRECVID 2007 [21]. As of one pair (I, P), it is reasonable to treat each pair indepen- dently using the same equation if P contains more than one 1 As mentioned, photo-sharing sites like Flickr [8] and photo blogs image, since different images belong to different page blocks. like MySpace [24] usually have very poor or few tags. Therefore, Furthermore, the above formulation can be easily extended it is desirable to obtain the hidden text from the visual content to discover further ad matches. to support both web site- and image- based advertising. For 2 T These concepts include Airplane, Animal, Boat Ship, Building, example, if we drop wgy Rg in equation (1), then it can Bus, Car, Charts, Computer TV-screen, Court, Crowd, Desert, support image-based advertising. To support web sites, the Face, Maps, Meeting, Mountain, Office, Outdoor, Person, Road, triggering pages (or ad landing pages) are first crawled, and Sky, Snow, Sports, Studio, Truck, Urban, Vegetation, Water- then each Web page can be treated using equation (1). scape Waterfront, and Weather. VB-1 VB-1-1

VB-1-2-1-1 VB-1-2-1-2 VB-1-2-1-2-1 Web page

VB-1

‘ ‘ •

’ ’

“”“” “ ”“

VB-1-2-1-2-2-1

– — – —

˜š˜ ™ ˜ ™˜š˜ ™

VB-1-2-2 ˜ ™

›œ Ÿ ›œ Ÿ Ÿ

 ž  ž ž  ž  ž

¡ ¡

¢ £¢¤¢ £¢¤¢ £ ¢ £¢ ¤¢ £¢ ¤¢¤

VB-1-2-1-2-2-2

¡ ¡

¢ £¢¤¢ £¢ £¢ £¢ £ ¢ £¢ ¤¢ £¢ ¤¢¤¢¤

(a) Segmentation result (b) “VB” structural tree

Figure 3: Vision-based structure of a sample Web page. “VB” indicates the visual block. The surrounding texts are extracted within the image blocks, i.e., “VB-1-2-1-2.” a result, the textual content I(T ) can be decomposed into Let P(T ) denote the textual information of Web page P, (T ) (T1∪T2∪T3) (T1) (T2) (T3) three parts, i.e., I = I = {I , I , I }, the global textual relevance Rg(P, aj ) is given by (T1) (T2) where I = {title, description}, I = {expansion words (T ) (T ) of I(T1)}, and I(T3) = {concepts detected based on I(V )}. Rg(P, aj ) = sim(P , aj ) (3) Figure 4 shows an example of the original surrounding text, (T ) By considering P(T ) as the query and a as the docu- expansion text, and concept text. The corresponding image j ment, we can rank the ads with regard to the Web page P, is extracted from “VB-1-2-1-2” in Figure 3. which is widely applied in most existing contextual advertis- Correspondingly, each ad aj can be represented by aj = (V ) (T ) (V ) (T ) ing systems [1] [5] [14] [23] [26]. We adopt the global textual {aj , aj }, where aj and aj denote the visual and tex- relevance for ad ranking as the baseline in ImageSense. Ac- tual information associated with aj , respectively. Note that cordingly, the local textual relevance R`(I, aj ) is (V ) (T ) aj corresponds to the visual content of ad logo and aj the (T ) (T ) (T1∪T2∪T3) (T ) original text provided by the advertiser without expansion R`(I, aj ) = sim(I , aj ) = sim(I , aj ) (4) (T ) words and concepts. The textual information aj consists The fusion of global and local textual relevance can be of four structural parts: a title, keywords, a textual descrip- regarded as ad reranking. We will investigate how local rel- tion, and a hyperlink. evance can benefit ad selection in the experiments. We use Okapi BM25 algorithm [28] which is widely used in information retrieval to compute the relevance between 4.4 Local Content Relevance an ad landing page and an ad [21]. Given a query q with n Note that ImageSense will embed the contextually rele- terms {q ,q ,...,qn}, the BM25 score of a document d is 1 2 vant ads at certain spatial positions within an image. With n

X analysis on local content relevance, we aim to find the non- idf(qi) × tf(qi,d) × (k + 1) sim(q,d) = (2) intrusive ad positions within an image and select the ads tf(qi,d) + k(1 − b + b × ndl(d)) i=1 whose logos are visually similar or have similar visual style to the image, in order to minimize intrusiveness and improve where tf(qi,d) is the term frequency of qi in d, ndl(d) = user experience. Specifically, the candidate ad insertion po- |d|/|d| is normalized document length, |d| and |d| indicate sitions (usually the blocks in B) are detected based on a the document length of d and the average document length salience measure which is derived from an image saliency in document collection, k and b are two parameters and map, while visual similarity is measured based on the HSV usually chosen as k =2.0 and b =0.75, idf(qi) is the inverse color feature. These two measures are combined for com- document frequency of the query qi, give by puting the local content relevance Rc(bi, aj ).

M − m(qi)+0.5 Given an Image I which is represented by a set of blocks Nb idf(qi) = log B = {bi} , a saliency map S representing the importance m(qi)+0.5 i=1 of the visual content is extracted for image I, by investigat- where M is the total number of documents in the collection, ing the effects of contrast in human perception [18]. Figure and m(qi) is the number of documents containing qi. 5(b) shows an example of the saliency map of an image. The

à á â â ã á á ç ã è â ã

ã î â ç ã

ì Â Á ä å æ æ Â é Ã ê ë

¿ÀÁÂÃ ê í Á é Ã ê ë

¥ ¦ § ¨ © ª « ¬ ­

Ë ± °¾ ʺ µ Ê ± ½ Ë º Î ³ ° Î Ô Ë º ± ¹ ¾ Ë Î º · ¶ ¶ ´ º

®

¯ °± ² ³ ´ µ ¶ · ¸ ³ ¹ ± º » ¼ ¯ ½¾

½¯ ° ± ² ³ ´ º ¼ Ë ¶ ¼ °± ½ ʯ º ¾ ° ³ ± Í ¼ Ô Ë º

Ë ¼ ¯ ½¾ º Ì · ¶ º ° ± ³ Í Ó ¶ Ò º µ Ì Í ° ³ µ Ë º

Ä ¦¬ ŪÆÇ ÈÆ©É­

² Ì ² ² ¯ º ½ ° ³ µ ² º ´ ³ Í ¶ µ ± Ì °º Ë ² ½ ³ °Ë º

¾ Ê Ë ² ¶ µ ¾ · ± ² ¶ Ì µ ² ¶ Í Ì ² Ë Í¾ µ ¶ ·

Ø Ê Ì · Ì º ¾ ± Ó ± º Î Ê Ë ¶ ½ µ º ¼ ° Ë ± ¾

± Í Î ³ Ë Í ¾ » ¼ ¯ ½ ¾ Ï Ð Ê Ë µ Ë ² ± µ µ ³ Ñ Ë

½¯ ° ± ² ³ ´ º ´ ± Ê ± Óº ± µ Ò ± ͺ ± Î ³ Ë Í¾ º

µ ¾ ¶ Í Ë µ ¾ ° Ì Î ¾ Ì °Ë µ Ò Ë ° Ë Ó Ì ³ Ô ¾

Ñ ± Ô Ô Ë ¯ ¶ · ¾ Ê Ë Ø ³ Í ¼ µ º ½ ¯ ° ³ ² ³ ´ º

± ° ¶ Ì Í ´ Õ Ö × × ¯ Ë ± ° µ ± ¼ ¶ ¶ Í ± °¶ Πد

¾ ± Í µ ³ غ ½ ± Í ¼ Ë ± º µ ± ß ß ± ° ± º

´ Ë µ Ë °¾ ½Ô±¾Ë± Ì ÎÔ¶µË ¾ ¶ ¾ Ê Ë Ù ³ Ô Ë Ï

Ê Ì °¼ Ê ± ´ ± º ¼ °Ë Ë Î Ë º Ë ¼ ¯ ½¾ ³ ± ͺ

Ú

Ì ¾ ¾ Ê Ë ³ ; ° ³¼ Ì ³ Í ¼ » ¼ ¯ ½¾ ³ ± Í

± · ° ³ Î ± º ´ Ì Ó ± ³º Î ¯ ½ ° Ì µ º Û ± ² ± ³ Î ± º

½¯ °± ² ³ ´ µ Ò Ë °Ë ² ¶ ° Ë ¾ Ê ± Í Û Ì µ ¾

¼ Ê ± Í ±

¾ ¶ ² Ó µ · ¶ ° Ø ³ Í ¼ µ Ï ÐÊË ² ¯ µ ¾ Ë ° ³ Ë µ

µ Ì °°¶ Ì Í´ ³Í¼ ¾ Ê Ë ³ °

µ ¯ ² Ó ¶ Ô ³ µ ² º ´ Ë µ ³ ¼ Í ± Í ´ ½ Ì ° ½ ¶ µ Ë

ï ç ã ð

à í ë é à ê ë

Ê ± Ñ Ë ³ Í µ ½ ³ ° Ë ´ ½ ± µ µ ³ ¶ Í ± ¾ Ë ´ Ë Ó ± ¾ Ë Ï

Ü

¾ ³ µ Ô ³ Ø Ë Ô¯ ¾ Ê ± ¾ ² ± Í ¯ ¶ · ¾ Ê Ë µ Ë

®

² ¯ µ ¾ Ë ° ³ Ë µ Ò ³ Ô Ô Í Ë Ñ Ë ° Ó Ë µ ¶ ÔÑ Ë ´ ÏÏÏ Ý̾ ´ ¶ ¶ °º Þ ¶ Ì Í¾ ± ³ ͺ Ë ° µ ¶ Í

Figure 4: An example of the original surrounding text, expansion text, and concept text. The image is extracted from “VB-1-2-1-2” in Figure 3.

1.0 0.8 0.6 0.8 1.0 bi. The distance is defined in a HSV color space which has been widely adopted in visual content retrieval and search systems [21], given by

0.8 0 0 0 0.8 ¬

K ¬

X X ¬ 1 ¬

(V ) (bi) (aj ) ¬ 0.60 0 0 0.6 d(Bi,aj ) = ¬f (k) − f (k) (5) |Bi| bi∈Bi k=1 0.8 0 0 0 0.8 where |Bi| denotes the number of blocks adjacent with bi, (bi) (aj ) 1.0 0.8 0.6 0.8 1.0 f (k) and f (k) denote the k-th color feature of block bi and ad aj , respectively. K is the feature dimension. For (a) (b) (c) example, if we use a 64-dimensional HSV color histogram to describe the visual content, then K = 64. As a result, Figure 5: An example of saliency map and weight the local content relevance Rc(bi, aj ) between a candidate

map of an image. (a) I: original image, (b) S: ad insertion point bi and an ad aj is given by

  saliency map with weight grids overlaid, (c) W: (V ) Rc(bi, aj ) = wi × (1 − si) × 1 − d(Bi, a ) (6) weight map. The brighter the pixel in the saliency j map, the more important or salient it is. 4.5 Problem Solution N N It is observed that there are CNa CNb N! solutions in total to equation (1). When the number of elements in A and B is brighter the pixel in the salience map, the more salient or large, the searching space for optimization increases dramat- important it is. Intuitively, the ads should be embedded at ically. For instance, if there are 10,000 ads in A, 16 blocks the most non-salient regions, so that the informative content in B (see Figure 5), and three ads embedded in each im- of the image will not be occluded and the users may not feel age, then the number of solutions will be nearly 5.6 × 10+14. intrusive. Meanwhile, the image block set B is obtained by When a Web page P contains more than one image for ad- partitioning image I into M × M grids. Each grid corre- vertising, the searching space will be extremely large. How- 2 sponds to a block bi (i =1, 2,...,Nb) where Nb = M , and ever, we can use the Genetic Algorithm (GA) to find solu- also a candidate ad insertion point. Consequently, there are tions approaching the global optimum. Practically, we can Nb candidate ad insertion points in total. For each block bi, adopt the following heuristic searching algorithm similar to a saliency energy si (0 6 si 6 1) is computed by averag- that used in [22] to solve the above problem. Algorithm ing all the normalized energies of the pixels within bi. As 1 give the details of the practical solution. As a result, the saliency map does not consider the spatial importance the number of possible solutions is significantly reduced to Nb 0 for ad insertion, a weight map W = {wi}i=1 is designed to O(Nb + Na + N × Na). weight the energy si, so that the ads will be inserted into the corners or sides rather than the center blocks. Figure 5(c) 5. EXPERIMENTS shows an example of the weight map. Therefore, wi ×(1−si) We evaluated the effectiveness of ImageSense from the indicates the suitability of block bi for embedding an ad. In addition to the non-salient ad insertion position, the perspectives of ad relevance and user experience. ads are assumed to have similar appearance with the neigh- boring blocks around the insertion position. To measure 5.1 Data and Methodology the visual similarity between aj and bi, we compute the L1 In ImageSense, we set M = 5 so that there are 16 can- (V ) distance between aj and the neighboring block set Bi of didate ad insertion points in Section 4.4. Note that the Algorithm 1 The heuristic searching algorithm for Eq. (1) news web sites, such as Yahoo! [32], CNN [7], MySpace [24], 1: Initialize: set the labels of all the elements in x and y as “0” as well as the popular pages in several major photo-sharing (i.e., “unselected”). web sites, such as Flickr [8], photoSIG [25]. Consequently, 2: Rank all the elements in x according to (wi × (1 − si)) in there were 1,100 triggering pages including 1,000 images and a descendent order, select the top N elements, and set their 100 pages used as the test bed. For each triggering page, we labels as “1” (i.e., “selected”). returned the top five ad logos by the following advertising 3: Rank all the elements in y according to wgRg + w`R` in a 0 0 strategies for evaluation, which is similar to the experimental descendent order, and select the top Na (N < Na  Na) elements. settings for general Web page advertising [14] [26]. Note that 4: For each xi in the top N elements in x, select the unselected the following results were produced according to different 0 yj from the top Na elements in y with the max{wcRc + settings of the parameters in equation (1). wgRg + w`R`}, and set the label of yj as “1”. 5: Output all the pairs with (xi =1,yj = 1). • I: Global textual relevance (wg = 1, w` = 0, wc = 0). This is similar to conventional contextual advertising in which the ads are selected according to blocks with the weight “0” are neglected. In conventional the generic content of the Web page [26]. This is used evaluations on textual relevance [5] [14] [26], the number of as the baseline for comparison. relevant ads is usually fixed to three (i.e., N = 3), while we • II: Global and local textual relevance without (T ) (T1∪T2) have set N = 5 for evaluating ad relevance and N = 1 concept (wg =0.3, w` =0.5, wc = 0, I = I ). for evaluating user experience in ImageSense, as well as In addition to global textual relevance, local textual 0 Na = 100. For concept text, we only selected the top five relevance based on Web page segmentation and image concept words according to the confidence scores as cur- block detection is also taken into account, while con- rent performance of concept detection is still not satisfying. cept text is neglected. A large-scale and general concept ontology rather than the one used in TRECVID will help improve relevance matching • III: Global and local textual relevance (wg =0.3, (T ) (T1∪T2∪T3) within the same framework. As there are currently very few w` = 0.5, wc = 0, I = I ). Both global ad-networks supporting real collection of ads with product and local textual relevance is linearly combined for ad logos, we have collected around 7, 285 unique ad product lo- ranking, with concept text included in the computa- gos from innovativeclipart.com. We invited 20 subjects to tion of local textual relevance. annotate each product logo (i.e., title, keywords, description, • IV: Global and local textual relevance, as well and hyperlink). As a result, these ads were associated with as local content relevance (wg = 0.33, w = 0.33, 32, 480 unique words. ` wc = 0.33). All the relevance including global and lo- The implementation of ImageSense consists of two part: cal textual relevance, as well as local content relevance, 1) offline process including crawler, page segmentation, im- is linearly combined. The weights are empirically set age block detection, image saliency extraction, and inverted to be equal for comparison, as we aim to investigate index building of ad database; 2) online matching including the different contributions of multimodal relevance for global and local relevance computation, as well as optimiza- image advertising. tion. According to the above parameters, the computational time for online matching is around 0.2 second. Note that it • V: Global and local textual relevance, as well as typically takes less than 0.1 second for ranking all the ad local content relevance, with different weights logos in the database based on the inverted index. to IV (wg = 0.3, w` = 0.5, wc = 0.2). All the rele- We prepared three photo-sharing sites with around one vance including global and local textual relevance, as million high-quality photos in total. Specifically, we col- well as local content relevance, is linearly combined. lected 443, 895 images from Corbis.com, 382, 371 images from Note that the weights are set different according to Tango (an internal photo-sharing site within Microsoft Re- different contributions from the multimodal relevance. search Asia), and around 200, 000 images from Flickr.com. This strategy is the optimal setting of ImageSense. We created two based on the images and their sur- rounding texts for Corbis and Flickr. To evaluate ad rele- As a result, there were 27, 500 (1100 × 5 × 5) page-ad vance, the 100 representative top image queries from Live pairs. For each triggering page, we selected no more than 25 image search [17] were put to the three sites as the queries ads. The ads, which consisted of logo, title, description, and and the top 10 images were returned for each query. Table hyperlink, appeared besides the images in the Web pages. 1 lists some representative image queries. As a result, there There were 15 subjects with different background invited are 1, 000 images and the corresponding surround texts. to manually judge the page-ad relevance. These subjects did not know by which among the above five strategies the current results were produced. Each page-ad pair has been Table 1: The representative top image queries. judged by three or more subjects on a 1-3 scale [5] as follows. party, pics, pyramids, dragon, girl, baby, hearts, zebra, spring, wallpaper, car, cat, rose, money, dog, flower, chat, • Irrelevant (1). The ad is definitely unrelated to the butterfly, fish, graffiti, cute kittens, angel, nature, football, image or the corresponding Web page. puppies, wedding cakes, horse • Somewhat relevant (2). The ad is related to the secondary subject of the image and the Web page, or In addition, similar to text advertising [5] [14] [26], we related to the image or the Web page in a general way. quantified the effect of ImageSense supporting Web page- based advertising using a set of 100 Web pages. These pages • Relevant (3). The ad is directly related to the main were selected from the most recent news in several major subject of the image and the Web page content.

ü¦

ûö

ý ò

I II III IV V I

ü¥

ý II

ñ÷ø ñ÷ø ñ÷ø

ñ÷û

û

ñ÷ù ñ÷ù

ò ñúú ò ò ñúú

ò

ñúõ

ñúö

ò ñúøñúø ò

ü¤ ò

ò III

ò

ò

ò ò

ý

ñóñ

ò IV

ñö÷

ü£ ò

ý V

ñö

ñôõ

ò

ò

ü¢

ý

ü¡

Precision

ñ

ý

ü

ý

ùö

üÿ

ò

ý

üþ

ý

ù

ü

üþ üÿ ü ü¡ ü¢ ü£ ü¤ ü¥ ü¦ þ

Images Web pages Overall ü

ý ý ý ý ý ý ý ý ý

Recall Figure 6: The evaluations on ad relevance. Figure 7: Precision-recall curves.

Table 2: Evaluations on ad position and satisfaction. Corbis image with poor tags Corbis image with ad embedded

Satisfaction score Conventional ads ImageSense Ad position 3.00 3.75 Overall 3.68 3.85

5.2 Evaluation on Ad Relevance After page segmentation, there were averagely 86.74 words Tags: domestic kitten per page in the 1100 triggering pages. In the following, we assumed that the page-ad pairs judged with “2” or “3” are Figure 8: An example of the image before (left) and positive and those judged with “1” are negative. We have after (right) advertising. also examined counting only the pairs judged with “3” as positive, but did not find a significant change in the relative performance. For each triggering page, there were 15.05 ads 5.3 Evaluation on User Experience returned among which 9.13 ads were judged relevant. We evaluated user experience in terms of ad insertion posi- Figure 6 shows the average evaluations on ad relevance tions and overall satisfaction from the perspective of viewers. using the five advertising strategies described in Section 5.1. The 15 subjects involved in our experiments were invited to It is reasonable to observe that combining surrounding text give a satisfaction score (1-5) of the ad position and overall can achieve better relevance than using only the text in the satisfaction. The higher score indicates the higher satis- whole page. We can see that local content relevance some- faction. We compared ImageSense to the setting in conven- times degraded relevance, which indicates the visual simi- tional advertising like Google AdSense [1] where a textual ad larity between ads and ad insertion points cannot improve unit (i.e., title, description, and hyperlink) is displayed ad- relevance. We believe local content relevance is a tradeoff jacent to the image on the right. For each triggering page, between ad relevance and user experience. Overall, we can two results including conventional advertising and Image- see that ImageSense can improve the contextual relevance Sense were provided to the user. Each page was judged for image advertising over conventional text advertising. by three different subjects. Note that the ads were identi- Figure 7 shows the precision-recall curves of the five strate- cal in the two results except for different presentations: the gies. We used 11-point average figures to quantify the pre- ads in conventional advertising are textual while the ads in cision [2] [26]. As we are not able to evaluate the entire ad ImageSense are product logos. The satisfaction score for collection, the recall values are relative to the set of eval- ad position in conventional advertising was fixed to 3.0 for uated ads. It is observed that the local surrounding text comparison. Table 2 lists the evaluation results. It is ob- obtained by Web page segmentation significantly improved served that ImageSense has provided better ad positions and ad relevance over the general text in the Web pages. Based overall satisfaction, and thus achieved better user experience on local textual information by strategy II, strategy III, IV, than conventional advertising. Table 2 corresponds to the and V achieved steady improvement over strategy I. We can fact that there were 78% and 55% judgments in which the also see that strategy III and V achieved the best perfor- score of ImageSense is higher than conventional advertising mance in terms of relevance, while strategy IV underper- in terms of ad position and overall satisfaction, respectively. formed V. This has proved the assumption on different con- tributions from different kinds of textual information—the 5.4 Advertising Examples local surround texts of the images and their expansion words Figure 8 shows an example of image advertising by the are more important than the general Web page content and five strategies. We can see that the image was tagged only visual content from the perspective of ad relevance. We be- with two words (“domestic” and “kitten”). The final adver- lieve an optimal set of parameters of (wg, w`, wc) can further tising result by ImageSense (based on strategy V) is shown improve the advertising performance. in Figure 8. Figure 9 shows an example of ImageSense for a Figure 9: An example of advertising for image search result page. The query is “scene.” The first three images in the top row and the first two in the bottom row are selected for advertising. page corresponding to the image search results of “scene.” It [3] S. Boyd and L. Vandenberghe. Convex Optimization. is observed that after image block detection, there are five Cambridge University Press, 2004. images selected for advertising. [4] BritePic. http://www.britepic.com/. [5] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel. 6. CONCLUSIONS A semantic approach to contextual advertising. In We have presented an innovative system for contextual in- Proceedings of SIGIR, 2007. image advertising. The system, called ImageSense, is able [6] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: a to automatically decompose the Web page into several co- vision-based page segmentation algorithm. In herent blocks, select the suitable images from these blocks Microsoft Technical Report (MSR-TR-2003-79), 2003. for advertising, detect the nonintrusive ad insertion posi- [7] CNN. http://www.cnn.com/. tions within the images, and associate the relevant adver- [8] Flickr. http://www.flickr.com/. tisements with these positions. In contrast to conventional [9] R. J. Gerrig and P. G. Zimbardo. Psychology and Life ad-networks which treat image advertising as general text (16 Edition). Allyn & Bacon, 2001. advertising, ImageSense is fully dedicated to image and aims [10] X. He, D. Cai, J.-R. Wen, W.-Y. Ma, and H.-J. to monetize the visual content rather than surrounding text. Zhang. Clustering and searching www images using We formalize the association of ads with images as an op- link and page layout analysis. ACM Transactions on timization problem which maximizes the overall contextual Multimedia Computing, Communications, and relevance. We have conducted extensive experiments over a Applications, 3(2), 2007. large-scale of real-world image data and achieved promising [11] J. Hu, H.-J. Zeng, H. Li, C. Niu, and Z. Chen. results in terms of ad relevance and user experience. To the Demographic prediction based on user’s browsing best of our knowledge, this is one of the first works showing behavior. In Proceedings of WWW, 2007. how visual content analysis at different levels (i.e., low-level [12] A. Joshi and R. Motwani. Keyword generation for appearance and high-level semantic) can yield more effec- advertising. In Proceedings of the tive image advertising, especially from the perspectives of Workshops of IEEE International Conference on Data ad position detection and contextual relevance. Mining, 2006. [13] L. Kennedy, M. Naaman, S. Ahern, R. Nair, and 7. REFERENCES T. Rattenbury. How Flickr helps us make sense of the [1] AdSense. http://www.google.com/adsense/. world: Context and content in community-contributed [2] R. Baeza-Yates and B. Ribeiro-Neto. Modern media collections. In Proceedings of ACM Multimedia, Information Retrieval. Addison Wesley, 1999. Augsburg, Germany, 2007. [14] A. Lacerda, M. Cristo, M. A. Goncalves, et al. [24] MySpace. http://www.myspace.com/. Learning to advertise. In Proceedings of SIGIR, 2006. [25] photoSIG. http://www.photosig.com/. [15] H. Li, S. M. Edwards, and J.-H. Lee. Measuring the [26] B. Ribeiro-Neto, M. Cristo, P. B. Golgher, and E. S. intrusiveness of advertisements: scale development and Moura. Impedance coupling in content-targeted validation. Journal of Advertising, 31(2):37–47, 2002. advertising. In Proceedings of SIGIR, 2005. [16] H. Li, D. Zhang, J. Hu, H.-J. Zeng, and Z. Chen. [27] M. Richardson, E. Dominowska, and R. Ragno. Finding keyword from online broadcasting content for Predicting clicks: Estimating click-through rate for targeted advertising. In International Workshop on new ads. In Proceedings of WWW, 2007. Data Mining and Audience Intelligence for [28] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Advertising, 2007. Okapi at TREC-7. In the Seventh Text REtrieval [17] Live Image Search. Conference, Gaithersburg, USA, Nov. 1998. http://www.live.com/?&scope=images. [29] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Building [18] Y.-F. Ma and H.-J. Zhang. Contrast-based image bridges for web query classification. In Proceedings of attention analysis by using fuzzy growing. In SIGIR, 2006. Proceedings of ACM Multimedia, pages 374–381, Nov [30] S. H. Srinivasan, N. Sawant, and S. Wadhwa. vADeo - 2003. video advertising system. In Proceedings of ACM [19] S. Mccoy, A. Everard, P. Polak, and D. F. Galletta. Multimedia, 2007. The effects of online advertising. Communications of [31] TRECVID. The ACM, 50(3):84–88, 2007. http://www-nlpir.nist.gov/projects/trecvid/. [20] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani. [32] Yahoo! Image. Adwords and generalized on-line matching. Journal of http://images.search.yahoo.com/images. the ACM, 54(5), Oct. 2007. [33] B. Yang, T. Mei, X.-S. Hua, L. Yang, S.-Q. Yang, and [21] T. Mei, X.-S. Hua, W. Lai, L. Yang, et al. M. Li. Online video recommendation based on MSRA-USTC-SJTU at TRECVID 2007: High-level multimodal fusion and relevance feedback. In feature extraction and search. In TREC Video Proceedings of CIVR, 2007. Retrieval Evaluation Online Proceedings, 2007. [34] Y. Yang and X. Liu. A re-examination of text [22] T. Mei, X.-S. Hua, L. Yang, and S. Li. VideoSense: categorization methods. In Proceedings of SIGIR, Towards effective online video advertising. In 1999. Proceedings of ACM Multimedia, pages 1075–1084, [35] W.-T. Yih, J. Goodman, and V. R. Carvalho. Finding Augsburg, Germany, 2007. advertising keywords on web pages. In Proceedings of [23] V. Murdock, M. Ciaramita, and V. Plachouras. A WWW, 2006. noisy channel approach to contextual advertising. In International Workshop on Data Mining and Audience Intelligence for Advertising, 2007.