Hashtag-Centric Immersive Search on Social Media MM’17, October 2017, Mountain View, CA USA

arXiv:1706.00695v1 [cs.IR] 2 Jun 2017 07AM 123-4567-24-567/08/06...$15.00 ACM. 2017 © ok ON) hsppradesstepolmintegrating problem the addresses paper This (OSNs). works C utb ooe.Asrcigwt rdti emte.T permitted. is credit with Abstracting honored. be must ACM aiu nieSca ewrs(Ss h udmna plat fundamental the (OSNs) Networks Social Online various uiGo ia ag oge e,Caghn u 07 Hashtag-ce 2017. Xu. Changsheng Ren, Tongwei Sang, Jitao Gao, Yuqi format: Reference ACM zdtewypol e cest hi neetdeet,ma events, interested their to access get people way the ized oilmdai eodn n icsigwa apn nrea in Its happens what discussing and recording is media Social INTRODUCTION 1 10.475/123_4 DOI: 20 October USA, pages. CA 9 View, Mountain In Multimedia, Media. on conference Social on Search Immersive multim social application, cross-OSN search, media social KEYWORDS queries. event trending of quantitativ hundreds and on qualitative ments th by validated of is eectiveness solution The posed display. cl for in organized hierarchy are f hashtag-item YouTube, items and related Flickr Twitter, the OSNs, query, three event an Given clu demonstration. representation, and hashtag for proposed brid is three-sta framework the A tion as organization. and OSNs, aggregation dierent information in for anno items to multi-modal used organize widely is and which hashtag, exploit We experience. me social immersive an facilitate to information cross-OSN So Online dierent in distributes information media Social ABSTRACT O:10.475/123_4 DOI: USA CA View, [email protected] Mountain from MM’17, permissions requ Request lists, fee. to a redistribute to and/or or servers on post to work publish, this of components n for this Copyrights bear page. copies rst that the and on thi tion advantage of commercial ar part copies or that or prot provided all for fee of without granted copies is hard use or classroom digital make to Permission o xml,atog lentv erhotosaesuppo sea are options Twitter search Taking alternative although OSN. example, for one from information suppor of only and exploration based single-OSN are functions search media esv vn ecito n nesadn ndrn fo perspectives. dierent dierent from in and understanding and description co event enables hensive information cross-OSN These Flickr. photos and inauguration stagram share Youtube, wat on Twitter, video on debate discuss progress read-time preside follow US 2016 people of election, event the regarding example, For [1]. d OSNs between propagates and distributes information relevant n rpgto iny h nomto nsca ei a media social its on rea in information features to the addition eciency, In propagation sharing. and and acquisition information for nsieo h rs-S itiuincaatrsi,mos characteristic, distribution cross-OSN the of spite In eltm information real-time ut-oredistribution multi-source aha-eti mesv erho oilMedia Social on Search Immersive Hashtag-centric and in propagation ecient uiGo ia ag oge e,Caghn Xu Changsheng Ren, Tongwei Sang, Jitao Gao, Yuqi rceig fAMinternational ACM of Proceedings eadn h aeevent, same the Regarding . rspirseicpermission specic prior ires o aeo distributed or made not e g. oyohrie rre- or otherwise, copy o okfrproa or personal for work s tc n h ulcita- full the and otice we yohr than others by owned a revolution- has 7(MM’17), 17 experi- e tdlike rted edia i search dia ilNet- cial esolu- ge ierent social t stering hand ch world. l l-time pro- e rmats mpre- In- on uster- form ntial king the t ntric rom tate rch the lso ge iesebde nTitrtet r o sgo stoeon those as good as not are tweets Twitter in embedded videos n oaiyfcscmonstedut nageainan aggregation in diculty the d organization. compounds the Moreover, focus times. modality viewed ent and rating relevance, date, by s supports Youtube relevance, and se interestingness supports date, by Flickr popularity, su and Twitter recentness die solution. by consistent support searching an r OSNs avoid search Organization. and the options (2) search make biased. will and query noisy inaccurate sults an b with solely query relevance collected appropriate items the an The create event. to the users describe accurately common Rel for (1) dicult problems. is several It from suers at level processing item the However, individual OSNs. dierent organi from and items collect turned directly to is solution straightforward aei nomto eosrto n oildsuso,r discussion, social and tively. adva demonstration haves information YouTube in and tage Flickr eciency, propagation the and ava ev data adequate to in features Twitter contribution While understanding. complementary die make from which information perspectives, describe OSNs lev endorsement coverage. in Information nor quality in neither YouTube and Flickr o ie.I srpre rmordt nlssta h imag the that analysis data our from reported is image It for video. Flickr fo for text, for usually Twitter e.g., OSN modality, single Popular on richness. bet Information a (1) from prevent perience. issues following the popularity, and time 2016” "Election issuing OSNs by dierent hashtags to collected The 1: Figure r grgtd raie,addmntae ssac resu search as dierent demonstrated from and information organized, related aggregated, the are query, event an Given nimriecosONsac rmwr stu retyne urgently thus is framework search cross-OSN immersive An YouTube , etere- the ze earching sdon ased ilability arching lts. evance. pports sand es e ex- ter espec- cuses l (2) el. OSNs ier- rent rent The ent the to n- e- d eded: MM’17, October 2017, Mountain View, CA USA Yuqi Gao, Jitao Sang, Tongwei Ren, Changsheng Xu

Figure 2: The solution framework.

This paper proposes to exploit the hashtags as bridge to solve are ranked according to the relevance to the query for search result the cross-OSN immersive search problem. Hashtag is a typical so- demonstration. In the following we summarize the main contribu- cial media feature widely used on dierent OSNs. It well addresses tions of this work: the above two issues regarding relevance and organization: (1) As We position the problem of cross-OSN immersive search. In- • a type of user annotation, hashtag guarantees the relevance of formation in multiple modalities and from dierent OSNs is the annotated items to the events. Moreover, more related items integrated and demonstrated around event queries. can be retrieved by querying the hashtag. (2) Hashtag is originally We propose a three-stage framework to exploit the hashtag • adopted for information management, regardless of OSN or modal- as bridge for cross-OSN information integration and demon- ity [2], making it a natural tool for cross-OSN and multi-modal stration. The popularity and relevance warranty of hashtags information organization. Fig. 1 illustrates the collected hashtags enable an ecient and eective solution. from the returned search results of one query from Twitter, Flickr We implemented an online demo for search result demonstra- • and YouTube where subtopics are marked with dierent colors. tion2. Real-world quantitative and qualitative evaluation demon- 1 Two quick observations derive : (1)Regarding the same event query, strates the advantage of the proposed solution. multiple hashtags are adopted on each OSN and vary between OSNs. (2)Dierent hashtags describe the dierent aspects, i.e., subtopics, 2 RELATED WORK of the event. The topic of comparing and fusing search results from multiple To consider the above observations and better exploit hashtag search engines has been addressed in several studies. In [3], the for cross-OSN information aggregation and organization, the fun- authors examined the characteristics of nine search engine logs in damental problem in our proposed solution is to discover the un- US and Europe. In [4], a crowd-ranking method is proposed to fuse derlying subtopics, and organize the multiple hashtags as well as the search results from dierent search engines for visual search the hashtag-annotated items under the discovered subtopics. As re-ranking. Regarding the comparison between traditional Web shown in Fig. 2, the hashtag-centric immersive search framework search and social media search, [5] made a comparison of users consists of three stages. The rst stage learns topical representa- search behaviors in Twitter search and Web search, and [6] exam- tion over a unique vocabulary space for hashtags on each OSN. At ined the dierence between traditional search engines and social the second stage, cross-OSN hashtags are clustered into subtopics media search in view of health related information. However, the considering both the semantic correlation between topics and hash- topic of exploiting the search results from dierent social media tag co-occurrence constrain. Finally, the derived hashtag clusters networks has been largely ignored.

1We will justify the two observations with more evidence in the data analysis section. 2https://hashtagasbridge.github.io/Hashtag/ Hashtag-centric Immersive Search on Social Media MM’17, October 2017, Mountain View, CA USA

1e7 1e5 Cross-OSN analysis and application has recently received at- 1.8 5 Flickr Youtube 1.6 Twitter Twitter tention. One important research line is user-centric, i.e., to inte- 4 1.4 grate the same individual’s cross-OSN information for user model- 1.2 3 ing. The authors in [7] introduced an immersive cross-OSN solu- 1.0 0.8 2 tion to construct unied user proles by associating user informa- 0.6 Avg. Resolution Avg. Avg. Duration(ms) Avg. 0.4 tion on Facebook and Twitter. [8, 9] addressed the topic of cross- 1 OSN recommendation by mining users’ interests between Twitter, 0.2 0.0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 YouTube and Pinterest. The other research line, which is content- Queries Queries centric and more relevant to this work, is to connect the topic/event (a)Resolution of image (b)Duration of video information across dierent OSNs. In [10], the authors proposed SocialTransfer, which is a cross-domain real-time learning frame- Figure 3: Information richness comparison. work to connect between Twitter and YouTube. A crowdsourcing solution is presented in [11] to discover the cross-OSN topic cor- 105 106 Youtube Youtube relations. Inspired by these pilot studies, in this work, we propose 5 104 10 Twitter Twitter 104 to address the topic of cross-OSN search result fusion and demon- 103 Flickr Flickr

103 stration. 102 102 101 Although hashtag was originally initiated by Twitter, it has be- 101 100 100 come a common functionality across dierent OSNs. Many studies Comment #Avg. 10−1 #Avg. Endorsement #Avg. 10−1

−2 have analyzed hashtag usage pattern or employed hashtag for ap- 10 10−2

10−3 10−3 plications. [2] examined the motivation, goal and usage patterns of 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Queries Queries users adopting hashtags. In [12], hashtag is analyzed and utilized for semantic organization and categorization. A recent work [13] (a)Number of comments (b)Number of endorsements exploited hashtag to discover the ne-grained event-related seman- Figure 4: User interaction comparison. tics within Twitter. These studies demonstrate the eectiveness of hashtag in social media information organization, which lay foun- dations and motivates us to implement a hashtag-centric solution We then compared how dierent the search results of three OSNs in this work. attract user interactions. Two types of interactions are examined, 3 DATA ANALYSIS i.e., comment and endorsement. For comment, the number of retweet is calculated on Twitter. For endorsement, like/dislike are counted To justify the cross-OSN search problem and motivate our solution on YouTube, and favorite is counted on Twitter and Flickr. Fig. 4(a)(b) using hashtag as bridge, this section conducts data analysis to rst illustrate the average number of comments and endorsements on answer two questions: (1) What are the advantages of integrating the three OSNs in log-scale. We see that the two gures reach simi- single-OSN search results? (2) How people use hashtag across dif- lar conclusions that YouTube search results generally attract more ferent OSNs? user interaction than Flickr and Twitter. Combining the above com- parisons on information richness and user interaction likelihood, 3.1 Single-OSN Search Comparison we justify the necessity of integrating single-OSN search results: We rst examined the information richness of search results re- not only a better multi-modal search experience is guaranteed, but turned from dierent OSNs. Specically, the resolution of shared more advanced features like social interaction can be explored and images and duration of shared videos are compared among Twitter, enabled. Flickr and YouTube. 347 queries are selected from Google Trends3, covering topics from politics, society, to economics and entertain- 3.2 Cross-OSN Hashtag Usage Analysis 4 ment. Each query is submitted to the OSN APIs to obtain totally This subsection addresses the availability and challenge of using 32,716 tweets from Twitter, 235,704 image items from Flickr and hashtag to integrate cross-OSN search results. In Fig. 5, we calcu- 195,505 video items from Youtube. Fig. 3(a)(b) show the average late within the returned search results per query, what percentage resolution and duration of returned/embeded images and videos of search results and users are with hashtag on the three OSNs. for each query, respectively. Regarding Twitter, only the tweets It is shown that hashtag is very popular on Twitter, with average with embedded images or videos are counted when calculating av- search result and user percentage above 25%. on Flickr, the per- erage resolution/duration. It is easy to see that the search results centage with hashtag varies between queries and the average per- from Flickr and YouTube capture signicant richer information centage is 13.9% for user and 8.1% for search result. YouTube shows than those from Twitter in terms of image resolution and video slightly lower hashtag popularity with about 4.7% average percent- duration. While Twitter has advantage in text-based information age. Considering the relative importance of the hashtag-annotated propagation, Flickr and YouTube can complement Twitter search search results5, this percentage is adequate for cross-OSN search results by providing more qualied images and videos. result integration. 3https://trends.google.com 4 Flickr: https://www.ickr.com/services/api/ 5It is calculated that the 4.7% YouTube videos with hashtag occupy over 10% total Twitter: https://dev.twitter.com/overview/api endorsement. Moreover, in our solution, the items to be integrated are not limited to Youtube: https://www.youtube.com/yt/dev/api-resources.html the API search results but inclusive of more items by retrieving the hashtags. MM’17, October 2017, Mountain View, CA USA Yuqi Gao, Jitao Sang, Tongwei Ren, Changsheng Xu

0. 5 0. 5 Table 2: NFr score to examine hashtag usage dierence be- Youtube (Avg.=0.047) Youtube (Avg.=0.049) Twitter (Avg.=0.271) Twitter (Avg.=0.255) tween OSNs. 0. 4 0. 4 Flickr (Avg.=0.081) Flickr (Avg.=0.139)

0. 3 0. 3 Twitter&Flickr Twitter&Youtube Flickr&Youtube 0. 2 0. 2

Avg. Proportion Avg. Proportion Avg. 0.1006 0.0857 0.0375 0. 1 0. 1

0. 0 0. 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Queries Queries 4 SOLUTION (a)Search results with hashtag (b)User with hashtag 4.1 Topical Representation Learning Figure 5: Popularity of hashtag. In order to fully represent the hashtags as well as explore more related content, for each returned hashtag, we further collected all the items annotated by the corresponding hashtags (referred as ex- Table 1: #. Unique hashtag used per query. tended search results). The rst stage learns hashtag topical representation from their annotated item set to facilitate the later hash- YouTube Twitter Flickr tag clustering and ranking. Two issues are addressed: (1) Regarding the same query, most search results are related to the query and the 17.77 28.42 27.61 returned hashtags may share a general topic from at topic modeling method like Latent Dirichlet Allocation (LDA). We employ the hierarchical topic modeling method, hLDA [16], to explore ne- grained semantics and avoid the learned topical distributions of Other than popularity across dierent OSNs, hashtag also fea- hashtags mixing with each other. (2) Topic modeling is separately tures in usage diversity. When talking about certain topics/events, conducted over the items on dierent OSNs. With dierent vocab- users are likely to create multiple hashtags and these hashtags ularies, the learned topical distribution of cross-OSN hashtags can- varies between OSNs. We counted the unique hashtags used per not be directly compared. In this case, random walk is employed query on the three OSNs and summarize the number in Table 1. It is over the word semantic graph to bridge the cross-OSN topics with noted that while multiple hashtags pose challenges to search result a unique integral vocabulary. In the following, we elaborate the integration, they also provide possibility to ne-grained analysis solution to the above two issues respectively. by exploring subtopic from hashtags. We further compare the used 4.1.1 Hierarchial Topic Modeling on respective OSN. To facili- multiple hashtags between OSNs. Spearman’s footrule [14, 15] is tate the exploration of semantic structure, we conduct topic mod- widely used to measure the similarity of list pairs. We employed a eling over the extended search result set from each query q. The ex- normalized version calculated as follows: tended search results constitute three document collections T , Y , F Dq Dq Dq for Twitter, YouTubeand Flickr respectively. Hierarchical topicmod- S Fr | | µ1, µ2 NFr µ1, µ2 = 1 ( ) (1) eling is conducted over each OSN collection, with the textual con- ( ) − max Fr S dT,Y, F 6 | | tent of each item q as document over the respective vocabulary space T,Y, F . where µ1, µ2 are two lists, S is the number of overlapping ele- W | | S 2 Dierent from the standard topic model like LDA which has a ments between two lists, max Fr | | equals 1 2 S when S is / | | S | | at topic structure, the hierarchical topic model, i.e., hLDA, orga- even and equals 1 2 S + 1 S 1 when S is odd, Fr µ1, µ2 / (| | )(| |− ) | | | |( ) nizes topics in a tree of xed depth M. Each document is assumed is the standard Spearman’s footrule measure calculated as: to be generated by topics on a single path from the root to a leaf S through the tree. Note that all documents share the root topic in S | | hierarchical topic model, which is consistent with the character- Fr µ1, µ2 = µ1 i µ2 i (2) | |( ) Õ | ( )− ( )| i=1 istics of search result collection. In our case, we select the tree depth M = 2. After topic modeling, taking Twitter as example, th each document dT is attached with a 2-dimension topic distribu- where µ1 i is the rank of i element in list µ1. The NFr score ( ) , T,leaf , ranges from 0 to 1. The higher the NFr score, more similar be- tion p zT root dT ,p z dT . zT root is the root topic and [ ( | ) ( k | )] tween the two lists. To calculate the NFr score, we rank the unique T,leaf th th T zk is the k leaf topic. For the i hashtag hi , its topical dis- hashtags returned from each OSN by the number of annotated tribution over the Twitter leaf topic space is aggregated over all its search results in descending order. Table 2 shows the NFr score annotated items and calculated as: between OSNs averaged over the 347 queries on average. It is ob- T,leaf T T dT vious that hashtag lists between OSNs are considerably dierent, d p zk Í ∈DhT ( | ) making direct integration infeasible based on shared cross-OSN T,leaf T = i p z hi T (3) ( k | ) K T,leaf T hashtags. Combining with the previous data analysis on popular- = dT T p z d Ík 1 Í hT ( k | ) ity, we conclude that hashtag is widely used and available as bridge ∈D i to integrate cross-OSN search results, but the integration and or- 6Textual content on each OSN is extracted as Twitter tweet, YouTube video title & ganization remain challenges due to the hashtag usage diversity. description and Flickr image title & description. Hashtag-centric Immersive Search on Social Media MM’17, October 2017, Mountain View, CA USA

T T where K is the number of leaf topics on Twitter, T denotes and respectively, the element of matrix H takes values fol- Dhi H T lowing ν, i.e., H ν . the collection of items annotated with hashtag hT . Noted that the ij ij i Hˆ ∼ H root topics regarding the same query across OSNs are assumed Let matrix be an approximation to , which depends only on mapping ρ,γ and the resultant summary statistics such as to be similar, and we represent the hashtag by only considering ( ) row and column marginal, co-cluster marginals. Then the optimal the leaf topic distribution. As a result, we obtain three topic spaces , , , , , co-clustering mapping ρ ,γ is identied such that the expected zT leaf , zY leaf , zF leaf over respective vocabulary set T Y F ( ∗ ∗) { } W Bregman divergence with respect to ν between Hˆ and H is mini- and each hashtag’s distribution over the corresponding topic space. mized: 4.1.2 Random Walk-based Cross-OSN Vocabulary Integration. The goal is to analyze the cross-OSN topics over a unique vocabulary ρ∗,γ ∗ = arg min E d H, Hˆ ( ) ρ,γ [ ϕ ( )] set all = T Y F . To bridge the dierent vocabu- Ð Ð (5) W W W W = arg min νijd Hij , Hˆij lary sets, the semantic relation between words are considered. We ρ,γ Õ Õ ϕ ( ) i j make use of WordNet [17] to calculate the similarity πij between all word wi and wj . With word w as nodes and word similar- where ϕ is a real-valued convex function and d z1,z2 is the Breg- ∈W ϕ ( ) ity π as weight, a word semantic graph G is constructed. man divergence dened as: Random walk has been widely used in information retrieval [18– d z1, z2 = ϕ z1 ϕ z2 < z1 z2, ϕ z2 > 20] to explore the semantic correlations. In this work, we conduct ϕ ( ) ( )− ( )− − ▽ ( ) random walk over the constructed word graph G to propagate ϕ is the gradient of ϕ. ▽ the relevance scores among words. Specically, a transition ma- 4.2.2 Co-Clustering with Bilateral Regularization. Bregman co- trix R all all is constituted, where the transition probabil- |W |×|W | clustering can be iteratively solved. During each iteration, three = all ity from word wi to wj is calculated as Rij πij wk πik . /Í ∈W subproblems are addressed. The rst subproblem updates the ap- At iteration l, the relevance score of node i is denoted as sl i , Hˆ s( ) proximation matrix by solving a Minimum Bregman Information and the relevance scores of all word nodes constitute a vector l = problem [21] with current mapping ρi ,γi . A permuted version of ..., s i ,... T . The random walk process is thus formulated as: ( ) [ l ( ) ] Hˆ is then generated by randomly changing rows or columns, which H˜ s + = α s R + 1 α t (4) is denoted as . The second and third sub-problems select the op- l 1 Õ l ( − ) i timal column and row cluster mappings based on the generated permuted matrix H˜ . Specically, the following two functions are where t denotes the initial probabilistic relevance scores as the orig- optimized: inal topic-word distribution, and α is a weight parameter that belongs to (0, 1). γi+1 t = arg min E d H, H˜ (6) ,..., H t ϕ The above process will promote the words with many close ( ) 1 Lcol | [ ( )] neighbors and weaken the isolated words. It is proved to converge = H H˜ ρi+1 h arg min ET h dϕ , (7) 1 1,...,Lr ow to s = 1 α l αR − t which is a xed point [19]. After random ( ) | [ ( )] ( − )( − ) all walk, we obtain a cross-OSN topic space z over the integral vo- where Lcol , Lrow denote the number of column and row clusters, all cabulary . EH t , ET h denote the expectation according to marginal distribu- W tion| of ν |by xing T = t and H = h. 4.2 Hashtag-Topic Co-Clustering In the case of our problem, and represent the cross-OSN H T H hashtag and topic sets. For each query, the matrix Nh Nt is con- This stage exploits the above-obtained topical distribution for cross- × structed by the obtained hashtag-topic distribution p zall hT,Y, F , OSN hashtag clustering. Two issues remain: (1) Although topics ( | ) from dierent OSNs are connected by integrating the vocabulary, where Nh, Nt denote the number of cross-OSN hashtags and top- each hashtag only has distribution over the topics on the corre- ics. To conduct hashtag-topic co-clustering, we introduce a novel sponding OSN, e.g., for Twitter hashtag hT :p zF,leaf hT = 0,p zY,leaf hmodelT = of Co-Clustering with Bilateral Regularization (CCBR). At ( | ) ( | ) 0. (2) Topics have intra-relation both within OSN and cross OSNs, each iteration, the second and third sub-problems in Bregman co- which need to be considered for hashtag clustering. The intra-relation clustering are modied by considering topic semantic relation and among topics are captured by the topic-word distribution over the hashtag co-occurrence information, respectively. unique vocabulary all . To address the two issues, we introduce The second sub-problem addresses column clustering, i.e., topic W a hashtag-topic co-clustering solution, and incorporate topic se- clustering. As in Eqn. (6), the standard column clustering solution mantic relation and hashtag co-occurrence information in topic only exploits the involvement of dierent topics with hashtags H and hashtag clustering respectively. The rest of the subsection rst recorded in . To incorporate the semantic relation between topics, reviews the standard Bregman co-clustering method, and then elab- we assume that the optimal topic cluster mapping γ captures not orates our proposed hashtag-topic co-clustering solution. only the topic-hashtag involvement but also the topic-word distribution. With the cross-OSN topic-word distribution p all zall (W | ) T all 4.2.1 Bregman Co-Clustering. Bregman co-clustering [21] is widely constituting a matrix Nt , we conduct row clustering over used in multi-dimensional data analysis. It aims to nd the optimal T simultaneously with the×|W column| clustering over H. The updated row and column mapping ρ ,γ of an existing matrix H dened optimal function is as follows: ( ∗ ∗) on two sets and . Let ν = νij ;i = 1, , , j = 1, , H T { ··· |H| ··· |T|} γi+1 t = arg min E d H, H˜ + E d T, T˜ (8) be the joint probability measure of variable pair H,T dening on ,..., H t ϕ W t ϕ ( ) ( ) 1 Lcol | [ ( )] | [ ( )] MM’17, October 2017, Mountain View, CA USA Yuqi Gao, Jitao Sang, Tongwei Ren, Changsheng Xu

Algorithm 1: Co-Clustering with Bilateral Regularization Demonstration consists of search result organization and search (CCBR) result description. Based on the hashtag clustering results, we pro- Input: Hashtag-topic matrix H, topic-word matrix T, hashtag pose to organize the search results following a cluster-hashtag- co-occurrence matrix O, the number of column and item hierarchy (illustrated in Fig. 10). Within hashtag, the items are organized chronologically. Within cluster, hashtags are ranked row clusters Lcol , Lrow . directly by the cluster-hashtag weight p h C . For the ranking be- Output: Optimal column and row mapping γ ∗, ρ∗ . ( | l ) ( ) tween hashtag clusters, their importance to describe the query is 1 Initialize i = 0, randomly generate an initial γi , ρi ; while ( ) not converge & not reach maximum iteration do evaluated. Two factors are considered to calculate the cluster importance score. (1) The frequency that hashtags appear in the search 2 Step A: Update Minimum Bregman Information solution result collection. Since the search results are basically relevant to Hˆ with current ρ ,γ ; t t the query, it is reasonable that the cluster with more hashtag-annotated ( ˜ ) 3 Construct permuted H for row and column respectively; items in the search result collection should be ranked higher. (2) 4 Step B: Update topic cluster mappings γ ; The semantic relation between clusters. It is assumed that two = 5 for t 1 to Nt do hashtag clusters with similar semantic relation deserve close rank- = H H˜ + T T˜ 6 γi+1 t arg min EH t dϕ , EW t dϕ , ings to the query. ( ) 1,...,Lcol | [ ( )] | [ ( )] We rst introduce how to evaluate the semantic relation be- 7 end tween clusters. Given the cluster-hashtag weight p h Cl and the 8 Step C: Update hashtag cluster mappings ρ; ( | ) hashtag-topic distribution p zt h , it is easy to obtain the cluster- 9 for h = 1 to N do ( | ) h topic distribution p zt ;zt zall C as: = H H˜ + O O˜ l 10 ρi+1 h arg min ET h dϕ , E dϕ , ( ∈ | ) ( ) 1,...,Lr ow | [ ( )] [ ( )] t = t p z Cl Õ p h Cl p z h (10) 11 end ( | ) ( | )· ( | ) h Cl 12 i i + 1. ∈ ← The semantic relevance κ between clusters C and C is then cal- 13 end ij i j culated as: 14 γ ∗ = γi 1, ρ∗ = ρi 1. − − t t 2 zt zall p z Ci p z Cj κij = exp Í ∈ ( ( | )− ( | )) (11) (− 2σ 2 ) where the second term on the right of the equation denotes the row where σ is set to be the mean value of pairwise Euclidean distance T clustering over , EH t denote expectation according to marginal between clusters. | = = T˜ The importance score of each cluster η η1,η2,..., ηLr ow is distribution by xing T t, and the generation of is similar to [ ] that of H˜ . Note that since we are only interested to the row clus- then obtained by minimizing the following cost function: tering of T, the second term is actually modeled as one-sided Breg- Lr ow Lr ow = 1 1 2 + 2 man clustering problem [22], which is a special case of Bregman Q η Õ κij ηi ηj ψ Õ ηi Ui (12) ( ) ( √D − D ) ( − ) co-clustering by setting the number of column cluster as the same i=1, j=1 ii p jj i=1 all with the word number . Lr ow |W | where D = κ , ψ is weight parameter, and U records the As for the row clustering sub-problem, i.e., hashtag clustering, ii Íj=1 ij i times that hashtags within cluster C appear in the search result we make assumption that the hashtags co-occuring in the same i collection. The above problem can be solved by updating the im- item have very high probability to belong to the same subtopic. 7 O portance score at each iteration : To achieve this goal, we build a matrix Nh Nh with element Oij × denoting the times that the ith and jth hashtags co-occur in the t+1 1 t η( ) = η( )S + ψU (13) same item. To construct a unied formulation, similar to Eqn. (8), 1 + ψ ( ) we constrain that the optimal hashtag cluster mapping ρ is also = 1 2 1 2 = where S D− / WD− / , D Diag D11, D22, ..., DLr ow Lr ow , and consistent with the clustering over O. Therefore, the updated opti- = ( ) U U1,U2, ...,U r ow . After convergence, the derived η is used [ L ] ∗ mal function is as follows: to rank the hashtag clusters for organization. Noted that the above = H H˜ + O O˜ ρi+1 h arg min ET h dϕ , E dϕ , (9) cluster ranking strategy makes the noisy hashtag clusters rank ( ) 1,...,Lr ow | [ ( )] [ ( )] very low due to their low appearance frequency and irrelevance By replacing Eqn. (6)(7) with Eqn. (8)(9), the overall pseudo code with other clusters. of the proposed hashtag-topic co-clustering model is summarized For search result description, since the hashtag clusters are ex- in Algorithm 1. pected to correspond to subtopics, we are motivated to generate semantic description for each hashtag cluster. Specically, we cal- 4.3 Search Result Demonstration culate the cluster-word semantic distribution p w Cl by aggregat- ( | ) ing the cluster-topic distribution and topic-word distribution: After hashtag-topic clustering, for each query, we obtain Lrow hash- t t tag clusters C1,C2, ...,CLr ow , with each cluster consisting of NCl p w C = p z C p w z (14) { } ( | l ) Õ ( | l )· ( | ) hashtags from dierent OSNs. For each hashtag, a cluster-hashtag zt zall ∈ weight p h Cl is also derived to reect the importance of hashtag ( | ) 7 h to cluster Cl . Detail for derivation is available in [23] Hashtag-centric Immersive Search on Social Media MM’17, October 2017, Mountain View, CA USA

200 Avg.=0.070 0.25 0.55

150

0.20 0.50

0.15 100 0.45 NMI

Proportion 0.10 0.40

50 Number of Hashtags of Number

0.05 0.35

0.00 0 0.30 0 50 100 150 200 250 300 350 0 20 40 60 80 100 5 10 15 20 25 30 Queries Queries Lcol

(a)Number of Hashtags (b)Comparison on Dierent Lcol Figure 6: Vocabulary overlap proportion. Figure 7: Settings of Experiments on Hashtag-Topic Co- Clustering Platform Topic ranks,states,worldpolitics,meet ... Twitter published,newsletter, history,online,... 0.6 million,trump,election,votes,riches... CC Flickr 0.5 CC+RR nation,people,language,americanelection... CC+CR 0.4 trump,donald,live,rally,presidenttrump,... CCBR Youtube 0.3 country,children, feel,hold, citizens,... NMI 0.2

0.1 Table 3: Visualization of topics from dierent OSNs of query 0.0 0 20 40 60 80 100 "Election 2016". Queries

Figure 8: Experiment Performance Comparision

As a result, each hashtag cluster (subtopic) can be represented by the 5-10 words with the highest p w C . ( | l ) Table 4: Pearson correlation coecient with dierent settings. 5 EXPERIMENTS Based on the same dataset used for data analysis in Section 3, we re- Method CC CC+CR CC+RR CCBR ported in the following the experimental results of the three stages, NMI & #hashtag -0.689 -0.633 -0.643 -0.526 respectively.

5.1 Results of Representation Learning 5.2 Results of Hashtag Clustering As introduced in the solution, given query q, hLDA is conducted 5.2.1 Experimental Seing. Given a hashtag-topic distribution T Y F H, the goal of Stage 2 is to generate hashtag clusters C1,C2,...,CLr ow . on three document collections q , q , q over the vocabulary { } D D D Therefore, we use Normalized Mutual Information (NMI) [24] as space T,Y, F respectively. Empirical setting is used with α = 10 the evaluation metric. Given a cluster label assignment C on h = W = . 1 γ 1 and η 0 1. After hLDA, random walk is conducted with C1 = objects, H C1 = |= | P i log P i represents the entropy where α 0.5 to connect the respective vocabulary spaces to construct ( ) Íi 1 ( ) ( ( )) all P i = C1i h denotes the probability that an object picked ran- a unied vocabulary . Fig. 6 shows the proportion of over- ( ) | |/ W domly from C belongs to a cluster C . Then the Normalized Mu- lapped vocabulary overlap = T Y F to all . It 1 1i W W Ñ W Ñ W W tual Information between two label assignments C and C is de- is shown only about 7% vocabulary is shared between the three 1 2 ned as: OSNs, which validate the necessity for random walk-based vocab- C1 C2 P i, j ulary integration. | | | | P i, j log ( ) Íi=1 Íj=1 ( ) ( P i P j ) Table 3 illustrates some of the discovered leaf topics on dierent NMI C1,C2 = ( ) ( ) (15) ( ) H C1 H C2 OSNs for the query “Election 2016”. Each topic is represented by p ( ) ( ) the top probable words, where the words appeared in the original C2 where H C2 = |= | P j log P j , P j = C2j h and P i, j = vocabulary space is marked with black and the words extended ( ) Íj 1 ( ) ( ( )) ( ) | |/ ( ) C1i C2j h. by random walk from Twitter are highlighted with cyan, Flickr | ∩ |/ To make use of the NMI, the clusters of 100 randomly selected with blue and Youtube with red. Two observations derive: (1) The queries are manually labeled by 5 volunteers. Regarding the label- discovered topics have a wide coverage and the topics on dierent ing strategy, volunteers were under the guideline that they need OSNs have some words and themes in common. (2) Random walk to divide the hashtags into 5-9 clusters which determined by the connects between dierent vocabulary spaces and enhances the number of hashtags of the search results as shown in Fig. 7(a). Af- topic representation with cross-OSN words. ter that, we conducted the co-clustering process with the labeled truth as the number of hashtag clusters Lrow . MM’17, October 2017, Mountain View, CA USA Yuqi Gao, Jitao Sang, Tongwei Ren, Changsheng Xu

Regarding the number of topic cluster Lcol , we varied it from 1.0

5 to 30 with the step of 5 and reported the mean NMI in Fig. 7(b). 0.8 = It is shown the clustering accuracy increases till Lcol 20 and decreases afterwards. This indicates dividing leaf topics into 20 0.6

= NDCG 0.4 clusters achieves best result and we thus select Lcol 20 in our experiments. For other parameters of co-clustering, we follow the 0.2 NDCG@3 (Avg.=0.673) empirical setting from [21] and choose basis 2 and Squared Eu- NDCG@5 (Avg.=0.796) 0.0 C 0 20 40 60 80 100 clidean distance as dϕ . Queries 5.2.2 Experimental Results and Analysis. Performance compar- Figure 9: NDCG for dierent queries ison among dierent methods is shown in Fig.8. CC is the original Bregman co-clustering method, CC + RR is the co-clustering method with hashtag co-occurrence regularization, CC +CR is the original co-clustering method with intra-topic correlation regularization, and CCBR represents the proposed method with bilateral regularization. The examined queries are shown in ascend order of the obtained NMI by CCBR. It is shown the curve CCBR is above the other curves for most of the queries, demonstrating the advantages of bilateral regularization. Only considering intra-topic correlation or hashtag co-occurrence also improves the clustering performance over the Bregman co-clustering method to a certain extent. (a) Page for Clusters Regarding the performance variance among dierent queries, we calculated the Pearson correlation coecient between the obtained NMI and the number of hashtags for each query. The average results of dierent methods are shown in Table 4. The negative coecients indicate that the query with larger number of hashtags are likely to achieve a lower NMI.

5.3 Search Result Demonstration 5.3.1 Experimental Seing. We focus on the evaluation of hashtag cluster ranking at this stage. After hashtag-topic clustering, the search result demonstration is executed on the hashtag clusters with the parameter ψ = 0.5. We utilize a widely used metric, Normalized Discounted Cumulative Gain (NDCG) for evaluation. NDCG is dened as: (b) Page for hashtags and items k Figure 10: An example of user interface 1 2r j 1 NDCG@k = ( ) − (16) Z Õ log 1 + j j=1 ( ) More detailed results of the immersive search results are available at the web-based demo9. After issuing a event query, the re- where r is the relevance between the query and the ranked clus- (·) lated hashtags and information from Twitter, YouTube and Flickr ter which is calculated by Eqn. 12. are automatically collected and processed. Search results are organized following the cluster-hashtag-item hierachy, as illustrated in 5.3.2 Experimental Results and Analysis. To make use of NDCG, Fig. 10. In Fig. 10(a), the hashtag cluster is described by the words 5 volunteers were requested to vote for the top-5 appropriate clus- extracted according to Eqn. (14). When clicking certain hashtag ters for each of the examined 100 queries. The ground-truth is cluster, the assigned hashtags with related items within the cluster averaged among the votes from volunteers. We show NDCG@3 are displayed as in Fig. 10(b). and NDCG@5 in Fig. 9 for dierent queries. It can be observed that when examining top-5 clusters, the proposed rank method 6 CONCLUSION AND FUTURE WORK achieves a high average NDCG as 79.6%. Considering most queries have a ground-truth of 5-9 clusters, this high NDCG@5 indicates This study has positioned the problem of cross-OSN immersive the practicality of the solution. When only examining the rank-1 search. A preliminary hashtag-centric solution is introduced. Hash- cluster, the proposed rank method still achieves a satised perfor- tags are collected and exploited to organize the search results from mance with average NDCG@1=37%. 8. dierent OSNs to help understand social events in a coarse-to-ne scheme. There is very long way to go before real-world application, and this work can be extended along several directions in the near 8Since NDCG@1 is either 0 or 1, we did not provide the detail results for each query in Fig.9 9https://hashtagasbridge.github.io/Hashtag/ Hashtag-centric Immersive Search on Social Media MM’17, October 2017, Mountain View, CA USA future: (1) considering the time distribution of the collected hash- [21] Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dhar- tags, to visualize and track the evolution of events among OSNs; (2) mendra S Modha. A generalized maximum entropy approach to bregman co- clustering and matrix approximation. Journal of Machine Learning Research, 8 exploring the social interaction potential of hashtag, e.g., analyzing (Aug):1919–1986, 2007. the users who adopt the hashtags and creating event-oriented user [22] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6 channels to enrich the immersive search experience. (Oct):1705–1749, 2005. [23] Dan Lu, Xiaoxiao Liu, and Xueming Qian. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia, 18(8):1628–1639, 2016. [24] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse REFERENCES framework for combining multiple partitions. Journal of machine learning re- [1] Jitao Sang, Changsheng Xu, and Ramesh Jain. Social multimedia ming: From search, 3(Dec):583–617, 2002. special to general. In Multimedia (ISM), 2016 IEEE International Symposium on, pages 481–485. IEEE, 2016. [2] Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. We know what@ you# tag: does the dual role aect hashtag adoption? In Proceedings of the 21st international conference on World Wide Web, pages 261–270. ACM, 2012. [3] Bernard J. Jansen and Amanda Spink. How are we searching the world wide web?: A comparison of nine search engine transaction logs. Inf. Process. Manage., 42(1):248–263, January 2006. [4] Yuan Liu, Tao Mei, and Xian-Sheng Hua. Crowdreranking: exploring multiple search engines for visual search reranking. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 500–507. ACM, 2009. [5] Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. # twittersearch: a comparison of microblog search and web search. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 35–44. ACM, 2011. [6] Munmun De Choudhury, Meredith Ringel Morris, and Ryen W White. Seeking and sharing health information online: Comparing search engines and social media. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems, pages 1365–1376. ACM, 2014. [7] Cheng-Kang Hsieh, Longqi Yang, Honghao Wei, Mor Naaman, and Deborah Estrin. Immersive recommendation: News and event recommendations using personal digital traces. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, pages 51–62, 2016. [8] Xitong Yang, Yuncheng Li, and Jiebo Luo. Pinterest board recommendation for twitter users. In Proceedings of the 23rd ACM international conference on Multimedia, pages 963–966. ACM, 2015. [9] Ming Yan, Jitao Sang, and Changsheng Xu. Unied youtube video recommendation via cross-network collaboration. In Proceedings of the 5th ACM on Inter- national Conference on Multimedia Retrieval, pages 19–26. ACM, 2015. [10] Suman Deb Roy, Tao Mei, Wenjun Zeng, and Shipeng Li. Socialtransfer: cross- domain transfer learning from social streams for media applications. In Pro- ceedings of the 20th ACM international conference on Multimedia, pages 649–658. ACM, 2012. [11] Ming Yan, Jitao Sang, and Changsheng Xu. Mining cross-network association for youtube video promotion. In Proceedings of the 22nd ACM international conference on Multimedia, pages 557–566. ACM, 2014. [12] Yuan Wang, JieLiu, JishiQu, Yalou Huang, Jimeng Chen, and Xia Feng. Hashtag graph based topic model for tweet mining. In Data Mining (ICDM), 2014 IEEE International Conference on, pages 1025–1030. IEEE, 2014. [13] Chen Xing, Yuan Wang, Jie Liu, Yalou Huang, and Wei-Ying Ma. Hashtag-based sub-event discovery using mutually generative lda in twitter. In Proceedings of the Thirtieth AAAI Conference on Articial Intelligence, AAAI’16, pages 2666– 2672, 2016. [14] Persi Diaconis and Ronald L Graham. Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society. Series B (Methodological), pages 262–268, 1977. [15] Judit Bar-Ilan, Mazlita Mat-Hassan, and Mark Levene. Methods for comparing rankings of search engine results. Computer networks, 50(10):1448–1463, 2006. [16] David M. Blei, Thomas L. Griths, Michael I. Jordan, and Joshua B. Tenen- baum. Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems 16 [Neural Information Pro- cessing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pages 17–24, 2003. [17] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. [18] Winston H Hsu, Lyndon S Kennedy, and Shih-Fu Chang. Video searchreranking through random walk over document-level context graph. In Proceedings of the 15th ACM international conference on Multimedia, pages 971–980. ACM, 2007. [19] Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang. Tag ranking. In Proceedings of the 18th international conference on World wide web, pages 351–360. ACM, 2009. [20] Zhengyu Deng, Jitao Sang, and Changsheng Xu. Personalized celebrity video search based on cross-space mining. In Pacic-Rim Conference on Multimedia, pages 455–463. Springer, 2012.