
Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems • Rank a set of items and display to users in corresponding order. • Two issues: performance on top and dealing with large search space. – web-page ranking ∗ rank pages for a query ∗ theoretical analysis with error criterion focusing on top – machine translation ∗ rank possible (English) translations for a given input (Chinese) sentence ∗ algorithm handling large search space 1 Web-Search Problem • User types a query, search engine returns a result page: – selects from billions of pages. – assign a score for each page, and return pages ranked by the scores. • Quality of search engine: – relevance (whether returned pages are on topic and authoritative) – other issues ∗ presentation (diversity, perceived relevance, etc) ∗ personalization (predict user specific intention) ∗ coverage (size and quality of index). ∗ freshness (whether contents are timely). ∗ responsiveness (how quickly search engine responds to the query). 2 Relevance Ranking: Statistical Learning Formulation • Training: – randomly select queries q, and web-pages p for each query. – use editorial judgment to assign relevance grade y(p; q). – construct a feature x(p; q) for each query/page pair. – learn scoring function f^(x(p; q)) to preserve the order of y(p; q) for each q. • Deployment: – query q comes in. ^ – return pages p1; : : : ; pm in descending order of f(x(p; q)). 3 Measuring Ranking Quality ^ • Given scoring function f, return ordered page-list p1; : : : ; pm for a query q. – only the order information is important. – should focus on the relevance of returned pages near the top. • DCG (discounted cumulative gain) with decreasing weight ci m ^ X DCG(f; q) = cir(pi; q): j=1 • ci: reflects effort (or likelihood) of user clicking on the i-th position. 4 Subset Ranking Model • x 2 X : feature (x(p; q) 2 X ) • S 2 S: subset of X (fx1; : : : ; xmg = fx(p; q): pg 2 S) – each subset corresponds to a fixed query q. – assume each subset of size m for convenience: m is large. • y: quality grade of each x 2 X (y(p; q)). • scoring function f : X × S ! R. – ranking function rf (S) = fjig: ordering of S 2 S based on scoring function f. Pm • quality: DCG(f; S) = ciEy j(x ;S) yj . i=1 ji ji i 5 Some Theoretical Questions • Learnability: – subset size m is huge: do we need many samples (rows) to learn. – focusing quality on top. • Learning method: – regression. – pair-wise learning? other methods? • Limited goal to address here: – can we learn ranking by using regression when m is large. ∗ massive data size (more than 20 billion) ∗ want to derive: error bounds independent of m. – what are some feasible algorithms and statistical implications. 6 Bayes Optimal Scoring • Given a set S 2 S, for each xj 2 S, we define the Bayes-scoring function as fB(xj;S) = Eyjj(xj;S) yj • The optimal Bayes ranking function rfB that maximizes DCG – induced by fB – returns a rank list J = [j1; : : : ; jm] in descending order of fB(xji;S). – not necessarily unique (depending on cj) • The function is subset dependent: require appropriate result set features. 7 Simple Regression • Given subsets Si = fxi;1; : : : ; xi;mg and corresponding relevance score fyi;1; : : : ; yi;mg. • We can estimate fB(xj;S) using regression in a family F: n m ^ X X 2 f = arg min (f(xi;j;Si) − yi;j) f2F i=1 j=1 • Problem: m is massive (> 20 billion) – computationally inefficient – statistically slow convergence p ∗ ranking error bounded by O( m)× root-mean-squared-error. • Solution: should emphasize estimation quality on top. 8 Importance Weighted Regression • Some samples are more important than other samples (focus on top). ^ 1 Pn • A revised formulation: f = arg minf2F n i=1 L(f; Si; fyi;jgj), with m X 2 0 2 L(f; S; fyjgj) = w(xj;S)(f(xj;S) − yj) + u sup w (xj;S)(f(xj;S) − δ(xj;S))+ j j=1 • weight w: importance weighting focusing regression error on top – zero for irrelevant pages • weight w0: large for irrelevant pages – for which f(xj;S) should be less than threshold δ. • importance weighting can be implemented through importance sampling. 9 Relationship of Regression and Ranking Let Q(f) = ESL(f; S), where L(f; S) = EfyjgjjSL(f; S; fyjgj) m X 2 0 2 = w(xj;S)Eyjj(xj;S) (f(xj;S) − yj) + u sup w (xj;S)(f(xj;S) − δ(xj;S))+: j j=1 Theorem 1. Assume that ci = 0 for all i > k. Under appropriate parameter choices with some constants u and γ, for all f: 0 1=2 DCG(rB) − DCG(rf ) ≤ C(γ; u)(Q(f) − inf Q(f )) : f 0 P Key point: focus on relevant documents on top; j w(xj;S) is much smaller than m. 10 Generalization Performance with Square Regularization ^T Consider scoring fβ^(x; S) = β (x; S), with feature vector (x; S): " n # 1 X T β^ = arg min L(β; Si; fyi;jgj) + λβ β ; (1) β2H n i=1 m X 2 0 2 L(β; S; fyjgj) = w(xj;S)(fβ(xj;S) − yj) + u sup w (xj;S)(fβ(xj;S) − δ(xj;S))+: j j=1 Theorem 2. Let M = sup kφ(x; S)k and W = sup [P w(x ;S) + x;S 2 S xj2S j u sup w0(x ;S)] f xj2S j . Let β^ be the estimator defined in (1). Then we have DCG(rB) − EfS ;fy g gn DCG(rf ) i i;j j i=1 β^ " 2 #1=2 WM T ≤C(γ; u) 1 + p inf (Q(fβ) + λβ β) − inf Q(f) : 2λn β2H f 11 Interpretation of Results • Result does not depend on m, but the much smaller quantity quantity W = sup [P w(x ;S) + u sup w0(x ;S)] S xj2S j xj2S j – emphasize relevant samples on top: w is small for irrelevant documents. – a refined analysis can replace sup over S by some p-norm over S. • Can control generalization for the top portion of the rank-list even with large m. – learning complexity does not depend on the majority of items near the bottom of the rank-list. – the bottom items are usually easy to estimate. 12 Key Points • Ranking quality near the top is most important – statistical analysis to deal with the scenario • Regression based algorithm to handle large search space – importance weighting of regression terms – error bounds independent of the massive web-size. 13 Statistical Translation and Algorithm Challenge • Problem: – conditioned on a source sentence in one language. – generate a target sentence in another language. • General approach: – scoring function: measuring the quality of translated sentence based on the source sentence (similar to web-search) – search strategy: effectively generate target sentence candidates. ∗ search for the optimal score. ∗ structure used in the scoring function (through block model). • Main challenge: exponential growth of search space 14 Graphical illustration b4 airspace Lebanese b3 violate b2 warplanes b1 Israeli A A A t A A A l l l n l l l T H A t m j l A r s h j w b } b r k A y n r y A l A A P } n t y y l y P 15 Block Sequence Decoder • Database: a set of possible translation blocks – e.g. block “a b” translates into potential block “z x y”. • Scoring function: – candidate translation: a block sequence (b1; : : : ; bn) – map each sequence of blocks into a non-negative score n Pn T sw(b1 ) = i=1 w F (bi−1; bi; o). • Input: source sentence. • Translation: – generate block sequence consistent with the source sentence. – find the sequence with largest score. 16 Decoder Training • Given source/target sentence pairs f(Si;Ti)gi=1;:::;n. • Given a decoding scheme implementing: z^(w; S) = arg maxz2V (S) sw(z). • Find parameter w such that on the training data: – z^(w; Si) achieves high BLEU score on average. • Key difficulty: large search space (similar issues in MLR) • Traditional tuning. • Our methods: – local training. – global training. 17 Traditional scoring function • A bunch of local statistics gathered from the training data, and beyond. – different language models of the target P (TijTi−1;Ti−2:::). ∗ can depend on statistics beyond training data. ∗ Google benefited significantly from huge language models. – orientation (swap) frequency. – block frequency. – block quality score: ∗ e.g. normalized in-block unigram translation probability: Q P (S; T ) = (fs1; : : : ; spg; ft1; : : : ; tqg); P (SjT ) = j nj i p(sjjti). • Linear combination of log-frequencies (five or six features): n P P – sw(fb1 g) = i j wj log fj(bi; bi−1; ··· ) • Tuning: hand-tuning; gradient descent adjustment to optimize BLEU score 18 Large scale training • Motivation: – want: the ability of incorporating large number of features. – require: automated training method to optimize millions of parameters. – similar to the transition from inktomi rank function tuning to MLR. • Challenges: search space V (S) is too large. Need to break it down. – direct global training using relevant set model: treats decoder as a black- box. 19 Global training of decoder parameter • Treat decoder as a black box, implementing: z^(w; S) = arg maxz2V (S) sw(z). • No need to know V (S). • Generate truth set as block sequences with K-largest blue scores: VK(S). • Generate alternatives as a subset of V (S) that are “most relevant”. – if w does well on the relevant alternatives, then it does well on the whole set V (S). – related to active sampling procedure in MLR. • Some related works in parsing exist. 20 Learning method • Try to minimize the following regularized empirical risk minimization: " N # 1 X 2 w^ = arg min Φ(w; VK(Si);V (Si)) + λw w m i=1 1 X 0 Φ(w; VK;V ) = max (w; z; z ); 0 K z 2V −VK z2VK 0 0 0 (w; z; z ) = φ(sw(z); Bl(z); sw(z ); Bl(z )); 0 0 • Relevant set: let ξi(w; z) = maxz 2V (Si)−VK(Si) (w; z; z ) (r) 0 0 V (w; ^ Si) = fz 2 V (Si): 9z 2 VK(Si); ξi(w; ^ z) 6= 0 & (w; ^ z; z ) = ξi(w; ^ z)g: • Key observation: can replace V by V (r) without changing solution.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages29 Page
-
File Size-