Result Disambiguation in Web People Search

Richard Berendsen, Bogomil Kovachev, Evangelia-Paraskevi Nastou, Maarten de Rijke and Wouter Weerkamp.

1

Tuesday, April 3, 2012 Setting the Scene

• Persistent interest in entity-oriented search

• Tutorial from expert finding to entity search

• Key research areas

• Named entity disambiguation

• Entity linking

2

Tuesday, April 3, 2012 People Search

• W. Weerkamp, K. Balog, M. de Rijke, R. Berendsen, B. Kovachev, and E. Meij. People searching for people: Analysis of a people search engine log. SIGIR’11

3

Tuesday, April 3, 2012 Key problem - Person name disambiguation

• “Michiel Bakker”

Weerkamp et al, 2011 4

Tuesday, April 3, 2012 Key problem - Person name disambiguation

• “Michiel Bakker”

Weerkamp et al, 2011 4

Tuesday, April 3, 2012 Key problem - Person name disambiguation

• “Michiel Bakker”

Name Profiles Herman de Vries 11 Michiel Bakker 11 Nicole Bakker 11 Nynke de Vries 10 Mirjam de Vries 10 Marjan de Jong 10 Annemieke de Vries 10 Arjan Visser 10 Weerkamp et al, 2011 4

Tuesday, April 3, 2012 Idea: clustering for result organization

• Group search results by person

• WePS-2 clustering task

• User only has to find right person

• Extract attributes from cluster

• WePS-2 attribute extraction task

J. Artiles, J. Gonzalo, and S. Sekine. Weps 2 evaluation campaign: overview of the web people search clustering task. In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009

5

Tuesday, April 3, 2012 Motivation

• Known problem, person name disambiguation

• New setting

• People search engine

• Many profiles

• Use evidence from log files

6

Tuesday, April 3, 2012 Agenda

• Search system

• Task definition

• Evaluation

• Results motivating idea

• Idea

• Results and discussion

7

Tuesday, April 3, 2012 Search system

8

Tuesday, April 3, 2012 Search system

8

Tuesday, April 3, 2012 Search system

8

Tuesday, April 3, 2012 Task definition

9

Tuesday, April 3, 2012 Task definition

• Given

• Search results for a person name query

• Obtained from Google, Yahoo!, Bing, Hyves, , LinkedIn, and MySpace

• Merged by URL

9

Tuesday, April 3, 2012 Task definition

• Given

• Search results for a person name query

• Obtained from Google, Yahoo!, Bing, Hyves, Facebook, LinkedIn, Twitter and MySpace

• Merged by URL

• Produce

• Soft clustering of documents according to entities referred to

9

Tuesday, April 3, 2012 Task evaluation

10

Tuesday, April 3, 2012 Task evaluation

• Follow WePS campaigns (2007, 2009, 2010)

10

Tuesday, April 3, 2012 Task evaluation

• Follow WePS campaigns (2007, 2009, 2010)

• Extended B-cubed precision and recall for soft clustering

• E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 2009

10

Tuesday, April 3, 2012 Task evaluation

• Follow WePS campaigns (2007, 2009, 2010)

• Extended B-cubed precision and recall for soft clustering

• E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 2009

• Weighted equally with F-measure

• Macro-averaged: average of F-measures on each query

10

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

d1

d3

d2

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3

d2

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3

d2

John Smith, jr

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3 : algorithm

d2

John Smith, jr

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3 : algorithm

d2

John Smith, jr

• Extended B-cubed precision d1: = 1, because

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3 : algorithm

d2

John Smith, jr

• Extended B-cubed precision d1: = 1, because

• with d2: predicts they share one cluster, all predictions correct: 1

11

Tuesday, April 3, 2012 Task evaluation, example query: John Smith

John Smith, sr d1 : ground truth

d3 : algorithm

d2

John Smith, jr

• Extended B-cubed precision d1: = 1, because

• with d2: predicts they share one cluster, all predictions correct: 1

• with d3: no claims, so undefined.

11

Tuesday, April 3, 2012 Task evaluation, example (continued)

John Smith, sr d1 : ground truth

d3 : algorithm

d2

John Smith, jr

• Extended B-cubed recall d1: = 0.5, because

• with d2: shares one cluster in ground truth, this is predicted: 1

• with d3: shares one cluster in ground truth, not predicted: 0

12

Tuesday, April 3, 2012 Generalization to other queries

• From a small, but interesting population:

• ambiguous (> 3 Hyves profiles with an outclick)

• some click activity

• > 6 searches with clicks on >1 result pane

• 700 unique queries, out of >4M

• Pairwise randomization and T tests, α=0.05 / 0.001, N = 33

13

Tuesday, April 3, 2012 Robustness to weighing precision and recall

14

Tuesday, April 3, 2012 Robustness to weighing precision and recall

• Unanimous improvement ratio (UIR)

• (E. Amigo ́, J. Gonzalo, J. Artiles, and M. F. Verdejo. Combining evaluation metrics via the unanimous improvement ratio and its application to clustering tasks, JAIR 2011)

14

Tuesday, April 3, 2012 Robustness to weighing precision and recall

• Unanimous improvement ratio (UIR)

• (E. Amigo ́, J. Gonzalo, J. Artiles, and M. F. Verdejo. Combining evaluation metrics via the unanimous improvement ratio and its application to clustering tasks, JAIR 2011)

• Let |Q| be all queries, P(X) and R(X) precision and recall of system X.

• UIR(X, Y) = (|QP(X) ≥ P(Y) ∩ R(X) ≥ R(Y) | - |QP(Y) ≥ P(X) ∩ R(Y) ≥ R(X) |) / |Q|

14

Tuesday, April 3, 2012 Robustness to weighing precision and recall

• Unanimous improvement ratio (UIR)

• (E. Amigo ́, J. Gonzalo, J. Artiles, and M. F. Verdejo. Combining evaluation metrics via the unanimous improvement ratio and its application to clustering tasks, JAIR 2011)

• Let |Q| be all queries, P(X) and R(X) precision and recall of system X.

• UIR(X, Y) = (|QP(X) ≥ P(Y) ∩ R(X) ≥ R(Y) | - |QP(Y) ≥ P(X) ∩ R(Y) ≥ R(X) |) / |Q|

• Rule of thumb for datasets ‘like WePS’:

• if UIR(X, Y) > 0.25 ---> improvement is ‘robust’

14

Tuesday, April 3, 2012 Hierarchical agglomerative clustering (HAC)

• At WePS campaigns (2007, 2009, 2010), algorithms using HAC in the pipeline were successful

• We implemented (C. Monz and W. Weerkamp. A comparison of retrieval-based hierarchical clustering approaches to person name disambiguation. In SIGIR’09, 2009)

• Tf-idf vectors, cosine similarity

• Stop clustering if sim(C1, C2) < θ

• θ = 0.225, after exploring some values on WePS-2 data.

15

Tuesday, April 3, 2012 But, it breaks

• State of the art in WePS-2: F > 0.80

BCP BCR F

HAC single link 0.56 0.87 0.67

HAC centroid 0.72 0.71 0.69

16

Tuesday, April 3, 2012 Wait a minute, ...

• What happens if we only cluster URLs that were obtained from either

• web search engines (Google, Yahoo!, Bing), or:

• social media platforms?

• Answer:

• Results on web search engines are better, much better!

• Results on social media platform are bad, really bad!

17

Tuesday, April 3, 2012 Obtained from web search engines

BCP BCR F

HAC single link 0.76 0.86 0.79

HAC centroid 0.89 0.70 0.75

• We evaluated by just popping out of the ground truth all search results that were obtained only through social media platform search engines.

18

Tuesday, April 3, 2012 Obtained from social media platforms

• Text based approaches broke completely

• Extraction of meaningful text fields did not help

• Some intuitions

• people may have a profile on several platforms

• people generally do not have more profiles per platform

• but they can be referred to from multiple profiles, especially when they are famous

19

Tuesday, April 3, 2012 Obtained from social media platforms

BCP BCR F one in one baseline 1.00 0.86 0.92 cross links 0.83 0.88 0.84 co-clicks 0.99 0.87 0.91 clicked in same burst 0.98 0.86 0.91 picasa 1.00 0.86 0.92

• Clustering social media profiles is hard!

• One in one: significant and robust improvements over ‘cross links’

20

Tuesday, April 3, 2012 Back to the full dataset, an idea

21

Tuesday, April 3, 2012 Back to the full dataset, an idea

• The results obtained through web search engines still have many social media profiles, so:

21

Tuesday, April 3, 2012 Back to the full dataset, an idea

• The results obtained through web search engines still have many social media profiles, so:

• Split documents in ‘social media profiles’ and ‘web documents’ by URL

• URL contains ‘hyves’, ‘facebook’, ‘’, ...

21

Tuesday, April 3, 2012 Back to the full dataset, an idea

• The results obtained through web search engines still have many social media profiles, so:

• Split documents in ‘social media profiles’ and ‘web documents’ by URL

• URL contains ‘hyves’, ‘facebook’, ‘linkedin’, ...

• A two step algorithm:

• 1. Cluster each dataset separately

• 2. Merge the two clusterings

21

Tuesday, April 3, 2012 Step 1: Cluster each set separately

• On results obtained from web search engines, HAC single link worked best, use this

• On results obtained through social media platforms, the one in one baseline worked best, use this

22

Tuesday, April 3, 2012 Step 2: Merge clustering

• Two methods

• Dual baseline: the union of both clusterings

• Dual merge: an iterative algorithm that merges social media platform clusters with ‘web clusters’ (next slide)

23

Tuesday, April 3, 2012 Dual merge

•While there are social clusters:

•For each social cluster

•Find closest web cluster, penalizing similarity if web cluster has social results already, with parameter w

•Put aside social clusters that can not be merged with any web cluster (sim < τ)

• For each web cluster, if it is the closest to some social clusters, merge it with the closest of those

24

Tuesday, April 3, 2012 Dual baseline vs not splitting data

• Large, significant improvement over both centroid and single link HAC without separately clustering social media profiles

• Robust improvement over centroid HAC without splitting

• Topic level comparison of F with HAC single link: 0.2 0.2 0.0 0.0 improvement (F05) improvement -0.2

0.1 0.2 0.3 0.4 0.5 0.6 0.2 − profile_ratio

Fig. 2. (Left) Difference in Fβ=0.5 score per query between dual merge and HAC single link; a 25 positive difference indicates a query where dual merge outperforms HAC single (and vice versa). Tuesday, April 3, 2012 (Right) Improvement in Fβ=0.5 versus the ratio of social documents in the total result set for a given query.

Our final set of experimental results concerns the dual strategies, for which results are listed in Table 4. We observe that even with a naive merging strategy (dual baseline), we manage to achieve scores comparable to those achieved with HAC on WePS-2. Apparently, we are able to suppress the negative impact resulting from social media results. The dual baseline has large and significant (￿￿ ) improvements over all other •• methods on our full dataset. It improves robustly only over centroid HAC, however. With regard to single link HAC, the dual baseline improves precision on all topics, but it also looses a bit of recall on all topics, indicating that there is room for improvement. The more sophisticated merge method (Algorithm 2) improves slightly but significantly and robustly on the dual baseline (￿ ). It has higher recall on about six out of ten •• queries and never a lower precision: the intended effect of this method. Dual merge has robust improvements also over centroid HAC and the one in one baseline. Analysis. In our analysis, we compare our Table 4. B-cubed precision and recall, macro dual strategy with the single link HAC averaged F-measure of the dual strategies. baseline and we investigate why our “so- Search results from all sources cial” methods fail to improve over the one 3 3 B PBRFβ=0.5 in one baseline. Dual baseline 0.90 0.78 0.82 Dual merge vs. single link HAC. As shown Dual merge 0.90 0.80 0.83 in the previous paragraph, single link HAC is the best performing method on the re- sults from search engines, which is why we use it as Mns in our dual strategy ap- proaches. Here, we compare this method to our dual merge strategy. Figure 2 (Left) compares the difference between the dual merge strategy and our baseline (single link HAC) on a per-query basis. For almost all queries, there is a clear improvement when using the dual strategy. Our strategy of treating social media documents in a separate manner leads to large improvements and we expect to see a stronger improvement in cases where more of social media documents are present. Figure 2 (Right) shows, however, that there is no clear correlation between the ratio of social media results returned for a query and the improvement after distinguishing between social and non-social search results. Returning to Figure 2 (Left), the query that shows the largest drop in performance, going from the HAC baseline to our dual strategy method, is the query with the highest Dual baseline vs dual merge

BCP BCR F

Dual baseline 0.90 0.78 0.82

Dual merge 0.90 0.80 0.83

• Tiny improvement

• But significant, and robust

• A few combinations of τ and w were explored

26

Tuesday, April 3, 2012 Parameter sensitivity of dual merge Table 5. Impact of parameters τ (Left) and w (Right) on the performance of the dual merge method.

3 3 3 3 τ wBPBRFβ=0.5 w τ B PBRFβ=0.5 τ =0.225 w =1.0 0.77 0.82 0.78 w =0.0 τ =0.5 0.74 0.82 0.76 τ =0.500 w =1.0 0.90 0.80 0.83 w =0.5 τ =0.5 0.86 0.81 0.82 τ =0.775 w =1.0 0.90 0.78 0.82 w =1.0 τ =0.5 0.90 0.80 0.83 w =1.5 τ =0.5 0.90 0.78 0.82 w =2.0 τ =0.5 0.90 0.78 0.82

The minimal• If we similarityset minimal threshold similarity in higher, HAC. precisionArtiles et increases, al. [4] observe recall that drops performance of HAC is strongly dependent on the minimal similarity threshold used as a stopping criterion.• If Different we set penalizing topics have factor different higher, optimal precision thresholds increases, and the recall authors drops provide an upper and lower bound for HAC by doing a parameter sweep and taking for each query the optimal value. The variety in optimal threshold is such that learning an average27 optimalTuesday, April value 3, 2012 on one dataset is no guarantee for success on another dataset. We try a different, query-dependent approach to estimating the similarity threshold, based on the observation that if a name is very ambiguous, we would require more evidence to cluster two documents with this name together. For example, it would not be unlikely to have two different John Smiths’s playing basketball in New York, but it would be unlikely to have two Jack Rumplestilskin’s doing so. A parameter sweep on the WePS-1 data shows that if a name is very ambiguous, we require a high similarity threshold, just as we expect. After visual inspection of the results, we perform a test run on the WePS-2 data with the following simple rule: if there are more than 500 LinkedIn profiles for a given query, use a similarity threshold of 0.360, otherwise use the default threshold of 0.225. Results of this experiment show a slight increase in precision, but an equal drop in recall, leading to the same F measure as a run without query-dependent thresholding. Finding a more sophisticated way to use query characteristics to predict thresholding parameters is an interesting direction for future research. Parameter sensitivity of our merging algorithm. Algorithm 2 has a number of param- eters. For HACsim we use single link clustering, as it performs best overall. Here, we explore the impact of the minimal similarity threshold (τ) and the parameter w, which regulates how strong the similarity between a social and a non-social cluster decreases for each social cluster already present in the non-social cluster. Table 5 lists the perfor- mance for different values of τ (Left), while keeping w stable, and different values of w (Right), while keeping τ stable. We find that increasing τ leads to better precision, but decreasing recall. For w, we find a similar patterns of improving precision with higher w-values, at the cost of recall.

7 Conclusion

In this paper we studied the problem of disambiguating the search results of a people search engine. Our results show that the increasing availability of results retrieved from social media platforms causes state-of-the-art methods to break down. Discussion

• Two documents of a different type

• Social media profiles are hard to cluster

• Little text, much boilerplate

• Other evidence: links, clicks, faces: also sparse

• Clustering them on textually rich clusters is possible to some extent

• Dataset available! (just not the clicklogs) At: http://ilps.science.uva.nl/resources/ecir2012rdwps

28

Tuesday, April 3, 2012 - Web documents and profiles - Treat them separately Result disambiguation - Then merge results

29

Tuesday, April 3, 2012