Index

A-Priori Algorithm, 194, 195, 201 Bandwidth, 20 Accessible page, 169 Basket, see Market basket, 184, 186, Ad-hoc query, 116 187, 216 Adomavicius, G., 320 Bayes net, 4 Advertising, 16, 97, 186, 261 BDMO Algorithm, 251 Adwords, 270 Beer and diapers, 188 Afrati, F.N., 53 Bell, R., 321 Agglomerative clustering, see Hier- Bellkor’s Pragmatic Chaos, 290 archical clustering Berkhin, P., 182 Aggregation, 24, 31, 35 BFR Algorithm, 234, 237 Agrawal, R., 220 Bid, 271, 273, 280, 281 Alon, N., 144 BigTable, 53 Alon-Matias-Szegedy Algorithm, 128 Bik, A.J.C., 54 Amplification, 83 Biomarker, 187 Analytic query, 50 Bipartite graph, 267 AND-construction, 83 BIRCH Algorithm, 260 Anderson, C., 320 Birrell, A., 54 Andoni, A., 110 Bitmap, 202 Apache, 22, 54 Block, 11, 19, 162 Archive, 114 Blog, 170 Ask, 174 Bloom filter, 122, 200 Association rule, 187, 189 Bloom, B.H., 144 Associativity, 25 Bohannon, P., 53 Attribute, 29 Bonferroni correction, 5 Auction, 273 Bonferroni’s principle, 4, 5 Austern, M.H., 54 Bookmark, 168 Authority, 174 Borkar, V., 53 Average, 126 Bradley, P.S., 260 Brick-and-mortar retailer, 186, 288, B-tree, 260 289 Babcock, B., 144, 260 Brin, S., 182 Babu, S., 144 Broad matching, 273 Bag, 37, 59 Broder, A.Z., 18, 110, 182 Balance Algorithm, 273 Bu, Y., 53 Balazinska, M., 53 Bucket, 9, 119, 134, 138, 200, 251 Band, 70 Budget, 272, 279

321 322 INDEX

Budiu, M., 54 Cosine distance, 76, 86, 293, 298 Burrows, M., 53 Counting ones, 132, 251 Craig’s List, 262 Candidate itemset, 197, 210 Craswell, N., 285 Candidate pair, 70, 201, 203 CURE Algorithm, 242, 246 Carey, M., 53 Currey, J., 54 Centroid, 223, 226, 232, 235, 239 Curse of dimensionality, 224, 248 Chabbert, M., 320 Cyclic permutation, 68 Chandra, T., 53 Cylinder, 12 Chang, F., 53 Czajkowski, G., 54 Characteristic matrix, 62 Charikar, M.S., 110 Darts, 122 Chaudhuri, S., 110 , 1 Checkpoint, 43 Data stream, 16, 214, 250, 264 Chen, M.-S., 220 Data-stream-management system, 114 Cholera, 3 , 16 Chronicle data model, 143 Datar, M., 111, 144, 260 Chunk, 21, 210, 238 Datar-Gionis-Indyk-Motwani Algo- CineMatch, 317 rithm, 133 Classifier, 298 Dead end, 149, 152, 153, 175 Click stream, 115 Dean, J., 53 Click-through rate, 265, 271 Decaying window, 139, 215 Cloud computing, 15 Decision tree, 298 CloudStore, 22 Dehnert, J.C., 54 Cluster computing, 20 del.icio.us, 294 Cluster tree, 246, 247 Deletion, 77 Clustera, 38, 52 Dense matrix, 28 Clustering, 3, 16, 221, 305 Density, 231, 233 Clustroid, 226, 232 DeWitt, D.J., 54 Collaborative filtering, 4, 17, 57, 261, DFS, see Distributed file system 287, 301 Diameter, 231, 233 Combiner, 25, 159, 161 Diapers and beer, 186 Communication cost, 44 Difference, 30, 33, 38 Commutativity, 25 Dimension table, 50 Competitive ratio, 16, 266, 269, 274 , 308 Compressed set, 238 Discard set, 238 Compute node, 20 Disk, 11, 191, 223, 246 Computer game, 295 Disk block, see Block Computing cloud, see Cloud com- Display ad, 262, 263 puting Distance measure, 74, 221 Confidence, 187, 189 Distinct elements, 124, 127 Content-based recommendation, 287, Distributed file system, 21, 184, 191 292 DMOZ, see Open directory Cooper, B.F., 53 Document, 56, 59, 187, 222, 281, Coordinates, 222 293, 294 INDEX 323

Document frequency, see Inverse doc- Frequent bucket, 200, 202 ument frequency Frequent itemset, 4, 184, 194, 197 Domain, 172 Frequent pairs, 194, 195 Dot product, 76 Frequent-items table, 196 Dryad, 52 Friends relation, 48 DryadLINQ, 53 Frieze, A.M., 110 Dup-elim task, 40 Gaber, M.M., 18 e, 12 Ganti, V., 110, 260 Edit distance, 77, 80 Garcia-Molina, H., 18, 182, 220, 260 Eigenvalue, 149 Garofalakis, M., 144 Eigenvector, 149 Gaussian elimination, 150 Elapsed communication cost, 46 Gehrke, J., 144, 260 Ensemble, 299 Genre, 292, 304, 318 Entity resolution, 91, 92 GFS, see Google file system Equijoin, 30 Ghemawat, S., 53, 54 Erlingsson, I., 54 Gibbons, P.B., 144 Ernst, M., 53 Gionis, A., 111, 144 Ethernet, 20 Global minimum, 310 Euclidean distance, 74, 89 Gobioff, H., 54 Euclidean space, 74, 78, 222, 223, Google, 146, 157, 270 226, 242 Google file system, 22 Exponentially decaying window, see Graph, 42 Decaying window Greedy algorithm, 264, 265, 268, 272 GRGPF Algorithm, 246 Facebook, 168 Grouping, 24, 31, 35 Fact table, 50 Grouping attribute, 31 Failure, 20, 26, 39, 40, 42 Gruber, R.E., 53 False negative, 70, 80, 209 Guha, S., 260 False positive, 70, 80, 122, 209 Gunda, P.K., 54 Family of functions, 81 Gyongi, Z., 182 Fang, M., 220 Fayyad, U.M., 260 Hadoop, 25, 54 Feature, 246, 292–294 Hadoop distributed file system, 22 Fetterly, D., 54 Hamming distance, 78, 86 Fikes, A., 53 Harris, M., 317 File, 20, 21, 191, 209 Hash function, 9, 60, 65, 70, 119, Filtering, 121 122, 125 Fingerprint, 94 Hash key, 9, 280 First-price auction, 273 Hash table, 9, 10, 12, 193, 200, 202, Fixedpoint, 84, 174 204, 280, 282 Flajolet, P., 144 Haveliwala, T.H., 182 Flajolet-Martin Algorithm, 125 HDFS, see Hadoop distributed file Flow graph, 38 system French, J.C., 260 Henzinger, M., 111 324 INDEX

Hierarchical clustering, 223, 225, 243, Jahrer, M., 321 306 Join, see Natural join, see Multiway HITS, 174 join, see Star join Hive, 53, 54 Join task, 40 Horn, H., 54 Howe, B., 53 K-means, 234 Hsieh, W.C., 53 Kalyanasundaram, B., 286 Hub, 174 Kamm, D., 317 Hyperlink-induced topic search, see Karlin, A., 266 HITS Kaushik, R., 110 Hyracks, 38 Kautz, W.H., 144 Key component, 119 Identical documents, 99 Key-value pair, 22–24 IDF, see Inverse document frequency Keyword, 271, 299 Image, 115, 293, 294 Kleinberg, J.M., 182 IMDB, see Internet Movie Database Knuth, D.E., 18 Imielinski, T., 220 Koren, Y., 320 Immediate subset, 212 Kosmix, 22 Immorlica, N., 111 Krioukov, A., 54 Important page, 146 Kumar, R., 18, 54, 182 Impression, 262 Kumar, V., 18 In-component, 151 Inaccessible page, 169 Lagrangean multipliers, 48 Index, 10 LCS, see Longest common subse- Indyk, P., 110, 111, 144 quence Initialize clusters, 235 Leiser, N, 54 Insertion, 77 Length, 128 Interest, 188 Length indexing, 100 Internet Movie Database, 292, 317 Leung, S.-T., 54 Intersection, 30, 33, 38 Lin, S., 111 Into Thin Air, 291 Linden, G., 320 Inverse document frequency, see TF.IDF, Linear equations, 150 8 Link, 29, 146, 160 Inverted index, 146, 262 Link matrix of the Web, 175 IP packet, 115 Link spam, 165, 169 Isard, M., 54 Livny, M., 260 Isolated component, 152 Local minimum, 310 Item, 184, 186, 187, 288, 304, 305 Locality-sensitive family, 86 Item profile, 292, 295 Locality-sensitive function, 81 Itemset, 184, 192, 194 Locality-sensitive hashing, 69, 80, 294 Jaccard distance, 74, 75, 82, 293 Logarithm, 12 Jaccard similarity, 56, 64, 74, 169 Long tail, 186, 288, 289 Jacobsen, H.-A., 53 Longest common subsequence, 77 Jagadish, H.V., 144 LSH, see Locality-sensitive hashing INDEX 325

Machine learning, 2, 298 Motwani, R., 111, 144, 220, 260 Maghoul, F., 18, 182 Mulihash Algorithm, 204 Mahalanobis distance, 241 Multiplication, 27, 35, 36, 159, 174 Main memory, 191, 192, 200, 223 Multiset, see Bag Malewicz, G, 54 Multistage Algorithm, 202 Manber, U., 111 Multiway join, 46 Manhattan distance, 75 Mumick, I.S., 144 Manning, C.P., 18 Mutation, 80 Many-many matching, 95 Many-many relationship, 184 Name node, see Master node Many-one matching, 95 Natural join, 30, 34, 35, 45 Map task, 22, 23, 25 Naughton, J.F., 54 Map worker, 25, 27 Navathe, S.B., 220 Map-reduce, 15, 19, 22, 27, 159, 161, Near-neighbor search, see locality- 210, 255 sensitive hashing Market basket, 4, 16, 183, 184, 191 Negative border, 212 Markov process, 149, 152 Netflix challenge, 2, 290, 317 Martin, G.N., 144 Newspaper articles, 97, 281, 290 Master controller, 22, 24, 25 Non-Euclidean distance, 232, see Co- Master node, 22 sine distance, see Edit dis- Matching, 267 tance, see Hamming dis- Matias, Y., 144 tance, see Jaccard distance Matrix, 27, 35, 36, see Transition Non-Euclidean space, 246, 248 matrix of the Web, see Stochas- Norm, 74, 75 tic matrix, see Substochas- Normal distribution, 237 tic matrix, 159, 174, see Normalization, 301, 303, 314 Utility matrix, 308 Matthew effect, 14 O’Callaghan, L., 260 Maximal itemset, 194 Off-line algorithm, 264 Maximal matching, 267 Olston, C., 54 Mean, see Average Omiecinski, E., 220 Median, 126 On-line advertising, see Advertising Mehta, A., 286 On-line algorithm, 16, 264 Merging clusters, 226, 229, 240, 244, On-line retailer, 186, 262, 288, 289 249, 253 Open Directory, 166 Merton, P., 18 OR-construction, 83 Minhashing, 63, 72, 76, 82, 294 Orthogonal vectors, 224 Minicluster, 238 Out-component, 151 Minutiae, 94 , 223 Mirrokni, V.S., 111 Overfitting, 299 Mirror page, 57 Overture, 271 Mitzenmacher, M., 110 Own pages, 170 Moments, 127 Monotonicity, 194 Paepcke, A., 111 Most-common elements, 139 Page, L., 145, 182 326 INDEX

PageRank, 3, 16, 27, 29, 40, 145, Randomization, 208 147, 159 Rarest-first order, 281 Pairs, see Frequent pairs Rastogi, R., 144, 260 Park, J.S., 220 Rating, 288, 291 Pass, 192, 195, 202, 208 Recommendation system, 16, 287 Paulson, E., 54 Recursion, 40 PCY Algorithm, 200, 202, 203 Reduce task, 22, 24 Pedersen, J., 182 Reduce worker, 25, 27 Perfect matching, 267 Reed, B., 54 Permutation, 63, 68 Reina, C., 260 PIG, 53 Relation, 29 Piotte, M., 320 Relational algebra, 29, 30 Plagiarism, 57, 187 Replication, 21 Pnuts, 53 Representation, 247 Point, 221, 251 Representative point, 243 Point assignment, 223, 234 Representative sample, 119 Polyzotis, A., 53 Reservoir , 144 Position indexing, 102, 104 Retained set, 238 Positive integer, 138 Revenue, 272 Powell, A.L., 260 Ripple-carry adder, 138 Power law, 13 RMSE, see Root-mean-square error Predicate, 298 Robinson, E., 54 Prefix indexing, 101, 102, 104 Root-mean-square error, 290, 309 Pregel, 42 Rounding data, 303 Principal eigenvector, 149 Row, see Tuple Priority queue, 229 Rowsum, 246 Privacy, 264 Royalty, J., 54 Probe string, 102 Profile, see Item profile, see User S-curve, 71, 80 profile Saberi, A., 286 Projection, 30, 32 Sample, 208, 212, 214, 217, 235, 243, Pruhs, K.R., 286 247 Puz, N., 53 Sampling, 118, 132 Savasere, A., 220 Query, 116, 135, 255 SCC, see Strongly connected com- ponent R-tree, 260 Schema, 29 Rack, 20 Schutze, H., 18 Radius, 231, 233 Score, 93 Raghavan, P., 18, 182 Search ad, 262 Rajagopalan, S., 18, 182 Search engine, 157, 173 Ramakrishnan, R., 53, 260 Search query, 115, 146, 168, 262, Ramsey, W., 285 280 Random hyperplanes, 86, 294 Second-price auction, 273 Random surfer, 146, 147, 152, 166 Secondary storage, see Disk INDEX 327

Selection, 30, 32 Striping, 28, 159, 161 Sensor, 115 Strongly connected component, 151 Set, 62, 100, see Itemset Strongly connected graph, 149 Set difference, see Difference Substochastic matrix, 152 Shankar, S., 54 Suffix length, 104 Shim, K., 260 Summarization, 3 Shingle, 59, 72, 97 Summation, 138 Shivakumar, N., 220 Superimposed code, see Bloom fil- Shopping cart, 185 ter, 143 Shortest paths, 42 Supermarket, 185, 208 Siddharth, J., 111 Superstep, 43 Signature, 62, 65, 72 Support, 184, 209, 210, 212, 214 Signature matrix, 65, 70 Supporting page, 170 Silberschatz, A., 144 Surprise number, 128 Silberstein, A., 53 SVD, see Singular-value decompo- Similarity, 4, 15, 56, 183, 294, 301 sition Singleton, R.C., 144 Swami, A., 220 Singular-value decomposition, 308 Szegedy, M., 144 Sketch, 88 Sliding window, 116, 132, 139, 251 Tag, 294 Smith, B., 320 Tail length, 125 SON Algorithm, 210 Tan, P.-N., 18 Space, 74, 221 Target page, 170 Spam, see Term spam, see Link spam Task, 21 Spam farm, 169, 172 Taxation, 152, 155, 170, 175 Spam mass, 172, 173 Taylor expansion, 12 Sparse matrix, 28, 63, 64, 159, 160, Taylor, M., 285 288 Teleport set, 166, 167, 172 Spider trap, 152, 155, 175 Teleportation, 156 Splitting clusters, 249 Tendril, 151 SQL, 29 Term, 146 Srikant, R., 220 Term frequency, see TF.IDF, 8 Srivastava, U., 53, 54 Term spam, 146, 169 Standard deviation, 239, 241 TF, see Term frequency Standing query, 116 TF.IDF, 7, 8, 293 Star join, 50 Theobald, M., 111 Stata, R., 18, 182 Thrashing, 161, 200 Statistical model, 1 Threshold, 71, 141, 184, 210, 214 Status, 281 TIA, see Total Information Aware- Steinbach, M., 18 ness Stochastic matrix, 149 Timestamp, 133, 252 Stop clustering, 227, 231, 233 Toivonen’s Algorithm, 211 Stop words, 8, 61, 97, 187, 293 Toivonen, H., 220 Stream, see Data stream Tomkins, A., 18, 54, 182 String, 100 Topic-sensitive PageRank, 165, 172 328 INDEX

Toscher, A., 321 Window, see Sliding window, see De- Total Information Awareness, 5 caying window Touching the Void, 291 Windows, 12 Transaction, see Basket Word, 187, 222, 293 Transition matrix of the Web, 148, Word count, 23 159, 160, 162 Worker process, 25 Transitive closure, 40 Workflow, 38, 40, 44 Transpose, 175 Working store, 114 Transposition, 80 Tree, 228, 246, 247, see Decision Xiao, C., 111 tree Yahoo, 271, 294 Triangle inequality, 74 Yerneni, R., 53 Triangular matrix, 192, 202 York, J., 320 Triples method, 193, 202 Yu, J.X., 111 TrustRank, 172 Yu, P.S., 220 Trustworthy page, 172 Yu, Y., 54 Tube, 152 Tuple, 29 Zhang, T., 260 Tuzhilin, A., 320 Zipf’s law, 15, see Power law Twitter, 281 Zoeter, O., 285 Ullman, J.D., 18, 53, 54, 220, 260 Union, 30, 33, 37 Universal set, 100 User, 288, 304, 305 User profile, 296 Utility matrix, 288, 291, 308 UV-decomposition, 308, 317

Variable, 128 Vazirani, U., 286 Vazirani, V., 286 Vector, 27, 74, 78, 149, 159, 174, 175, 222 Vitter, J., 144 von Ahn, L., 295, 321

Wallach, D.A., 53 Wang, J., 317 Wang, W., 111 Weaver, D., 53 Web structure, 151 Weiner, J., 18, 182 Whizbang Labs, 2 Widom, J., 18, 54, 144, 260