A-Priori Algorithm, 194, 195, 201 Accessible Page, 169 Ad-Hoc Query
Total Page:16
File Type:pdf, Size:1020Kb
Index A-Priori Algorithm, 194, 195, 201 Bandwidth, 20 Accessible page, 169 Basket, see Market basket, 184, 186, Ad-hoc query, 116 187, 216 Adomavicius, G., 320 Bayes net, 4 Advertising, 16, 97, 186, 261 BDMO Algorithm, 251 Adwords, 270 Beer and diapers, 188 Afrati, F.N., 53 Bell, R., 321 Agglomerative clustering, see Hier- Bellkor’s Pragmatic Chaos, 290 archical clustering Berkhin, P., 182 Aggregation, 24, 31, 35 BFR Algorithm, 234, 237 Agrawal, R., 220 Bid, 271, 273, 280, 281 Alon, N., 144 BigTable, 53 Alon-Matias-Szegedy Algorithm, 128 Bik, A.J.C., 54 Amplification, 83 Biomarker, 187 Analytic query, 50 Bipartite graph, 267 AND-construction, 83 BIRCH Algorithm, 260 Anderson, C., 320 Birrell, A., 54 Andoni, A., 110 Bitmap, 202 Apache, 22, 54 Block, 11, 19, 162 Archive, 114 Blog, 170 Ask, 174 Bloom filter, 122, 200 Association rule, 187, 189 Bloom, B.H., 144 Associativity, 25 Bohannon, P., 53 Attribute, 29 Bonferroni correction, 5 Auction, 273 Bonferroni’s principle, 4, 5 Austern, M.H., 54 Bookmark, 168 Authority, 174 Borkar, V., 53 Average, 126 Bradley, P.S., 260 Brick-and-mortar retailer, 186, 288, B-tree, 260 289 Babcock, B., 144, 260 Brin, S., 182 Babu, S., 144 Broad matching, 273 Bag, 37, 59 Broder, A.Z., 18, 110, 182 Balance Algorithm, 273 Bu, Y., 53 Balazinska, M., 53 Bucket, 9, 119, 134, 138, 200, 251 Band, 70 Budget, 272, 279 321 322 INDEX Budiu, M., 54 Cosine distance, 76, 86, 293, 298 Burrows, M., 53 Counting ones, 132, 251 Craig’s List, 262 Candidate itemset, 197, 210 Craswell, N., 285 Candidate pair, 70, 201, 203 CURE Algorithm, 242, 246 Carey, M., 53 Currey, J., 54 Centroid, 223, 226, 232, 235, 239 Curse of dimensionality, 224, 248 Chabbert, M., 320 Cyclic permutation, 68 Chandra, T., 53 Cylinder, 12 Chang, F., 53 Czajkowski, G., 54 Characteristic matrix, 62 Charikar, M.S., 110 Darts, 122 Chaudhuri, S., 110 Data mining, 1 Checkpoint, 43 Data stream, 16, 214, 250, 264 Chen, M.-S., 220 Data-stream-management system, 114 Cholera, 3 Database, 16 Chronicle data model, 143 Datar, M., 111, 144, 260 Chunk, 21, 210, 238 Datar-Gionis-Indyk-Motwani Algo- CineMatch, 317 rithm, 133 Classifier, 298 Dead end, 149, 152, 153, 175 Click stream, 115 Dean, J., 53 Click-through rate, 265, 271 Decaying window, 139, 215 Cloud computing, 15 Decision tree, 298 CloudStore, 22 Dehnert, J.C., 54 Cluster computing, 20 del.icio.us, 294 Cluster tree, 246, 247 Deletion, 77 Clustera, 38, 52 Dense matrix, 28 Clustering, 3, 16, 221, 305 Density, 231, 233 Clustroid, 226, 232 DeWitt, D.J., 54 Collaborative filtering, 4, 17, 57, 261, DFS, see Distributed file system 287, 301 Diameter, 231, 233 Combiner, 25, 159, 161 Diapers and beer, 186 Communication cost, 44 Difference, 30, 33, 38 Commutativity, 25 Dimension table, 50 Competitive ratio, 16, 266, 269, 274 Dimensionality reduction, 308 Compressed set, 238 Discard set, 238 Compute node, 20 Disk, 11, 191, 223, 246 Computer game, 295 Disk block, see Block Computing cloud, see Cloud com- Display ad, 262, 263 puting Distance measure, 74, 221 Confidence, 187, 189 Distinct elements, 124, 127 Content-based recommendation, 287, Distributed file system, 21, 184, 191 292 DMOZ, see Open directory Cooper, B.F., 53 Document, 56, 59, 187, 222, 281, Coordinates, 222 293, 294 INDEX 323 Document frequency, see Inverse doc- Frequent bucket, 200, 202 ument frequency Frequent itemset, 4, 184, 194, 197 Domain, 172 Frequent pairs, 194, 195 Dot product, 76 Frequent-items table, 196 Dryad, 52 Friends relation, 48 DryadLINQ, 53 Frieze, A.M., 110 Dup-elim task, 40 Gaber, M.M., 18 e, 12 Ganti, V., 110, 260 Edit distance, 77, 80 Garcia-Molina, H., 18, 182, 220, 260 Eigenvalue, 149 Garofalakis, M., 144 Eigenvector, 149 Gaussian elimination, 150 Elapsed communication cost, 46 Gehrke, J., 144, 260 Ensemble, 299 Genre, 292, 304, 318 Entity resolution, 91, 92 GFS, see Google file system Equijoin, 30 Ghemawat, S., 53, 54 Erlingsson, I., 54 Gibbons, P.B., 144 Ernst, M., 53 Gionis, A., 111, 144 Ethernet, 20 Global minimum, 310 Euclidean distance, 74, 89 Gobioff, H., 54 Euclidean space, 74, 78, 222, 223, Google, 146, 157, 270 226, 242 Google file system, 22 Exponentially decaying window, see Graph, 42 Decaying window Greedy algorithm, 264, 265, 268, 272 GRGPF Algorithm, 246 Facebook, 168 Grouping, 24, 31, 35 Fact table, 50 Grouping attribute, 31 Failure, 20, 26, 39, 40, 42 Gruber, R.E., 53 False negative, 70, 80, 209 Guha, S., 260 False positive, 70, 80, 122, 209 Gunda, P.K., 54 Family of functions, 81 Gyongi, Z., 182 Fang, M., 220 Fayyad, U.M., 260 Hadoop, 25, 54 Feature, 246, 292–294 Hadoop distributed file system, 22 Fetterly, D., 54 Hamming distance, 78, 86 Fikes, A., 53 Harris, M., 317 File, 20, 21, 191, 209 Hash function, 9, 60, 65, 70, 119, Filtering, 121 122, 125 Fingerprint, 94 Hash key, 9, 280 First-price auction, 273 Hash table, 9, 10, 12, 193, 200, 202, Fixedpoint, 84, 174 204, 280, 282 Flajolet, P., 144 Haveliwala, T.H., 182 Flajolet-Martin Algorithm, 125 HDFS, see Hadoop distributed file Flow graph, 38 system French, J.C., 260 Henzinger, M., 111 324 INDEX Hierarchical clustering, 223, 225, 243, Jahrer, M., 321 306 Join, see Natural join, see Multiway HITS, 174 join, see Star join Hive, 53, 54 Join task, 40 Horn, H., 54 Howe, B., 53 K-means, 234 Hsieh, W.C., 53 Kalyanasundaram, B., 286 Hub, 174 Kamm, D., 317 Hyperlink-induced topic search, see Karlin, A., 266 HITS Kaushik, R., 110 Hyracks, 38 Kautz, W.H., 144 Key component, 119 Identical documents, 99 Key-value pair, 22–24 IDF, see Inverse document frequency Keyword, 271, 299 Image, 115, 293, 294 Kleinberg, J.M., 182 IMDB, see Internet Movie Database Knuth, D.E., 18 Imielinski, T., 220 Koren, Y., 320 Immediate subset, 212 Kosmix, 22 Immorlica, N., 111 Krioukov, A., 54 Important page, 146 Kumar, R., 18, 54, 182 Impression, 262 Kumar, V., 18 In-component, 151 Inaccessible page, 169 Lagrangean multipliers, 48 Index, 10 LCS, see Longest common subse- Indyk, P., 110, 111, 144 quence Initialize clusters, 235 Leiser, N, 54 Insertion, 77 Length, 128 Interest, 188 Length indexing, 100 Internet Movie Database, 292, 317 Leung, S.-T., 54 Intersection, 30, 33, 38 Lin, S., 111 Into Thin Air, 291 Linden, G., 320 Inverse document frequency, see TF.IDF, Linear equations, 150 8 Link, 29, 146, 160 Inverted index, 146, 262 Link matrix of the Web, 175 IP packet, 115 Link spam, 165, 169 Isard, M., 54 Livny, M., 260 Isolated component, 152 Local minimum, 310 Item, 184, 186, 187, 288, 304, 305 Locality-sensitive family, 86 Item profile, 292, 295 Locality-sensitive function, 81 Itemset, 184, 192, 194 Locality-sensitive hashing, 69, 80, 294 Jaccard distance, 74, 75, 82, 293 Logarithm, 12 Jaccard similarity, 56, 64, 74, 169 Long tail, 186, 288, 289 Jacobsen, H.-A., 53 Longest common subsequence, 77 Jagadish, H.V., 144 LSH, see Locality-sensitive hashing INDEX 325 Machine learning, 2, 298 Motwani, R., 111, 144, 220, 260 Maghoul, F., 18, 182 Mulihash Algorithm, 204 Mahalanobis distance, 241 Multiplication, 27, 35, 36, 159, 174 Main memory, 191, 192, 200, 223 Multiset, see Bag Malewicz, G, 54 Multistage Algorithm, 202 Manber, U., 111 Multiway join, 46 Manhattan distance, 75 Mumick, I.S., 144 Manning, C.P., 18 Mutation, 80 Many-many matching, 95 Many-many relationship, 184 Name node, see Master node Many-one matching, 95 Natural join, 30, 34, 35, 45 Map task, 22, 23, 25 Naughton, J.F., 54 Map worker, 25, 27 Navathe, S.B., 220 Map-reduce, 15, 19, 22, 27, 159, 161, Near-neighbor search, see locality- 210, 255 sensitive hashing Market basket, 4, 16, 183, 184, 191 Negative border, 212 Markov process, 149, 152 Netflix challenge, 2, 290, 317 Martin, G.N., 144 Newspaper articles, 97, 281, 290 Master controller, 22, 24, 25 Non-Euclidean distance, 232, see Co- Master node, 22 sine distance, see Edit dis- Matching, 267 tance, see Hamming dis- Matias, Y., 144 tance, see Jaccard distance Matrix, 27, 35, 36, see Transition Non-Euclidean space, 246, 248 matrix of the Web, see Stochas- Norm, 74, 75 tic matrix, see Substochas- Normal distribution, 237 tic matrix, 159, 174, see Normalization, 301, 303, 314 Utility matrix, 308 Matthew effect, 14 O’Callaghan, L., 260 Maximal itemset, 194 Off-line algorithm, 264 Maximal matching, 267 Olston, C., 54 Mean, see Average Omiecinski, E., 220 Median, 126 On-line advertising, see Advertising Mehta, A., 286 On-line algorithm, 16, 264 Merging clusters, 226, 229, 240, 244, On-line retailer, 186, 262, 288, 289 249, 253 Open Directory, 166 Merton, P., 18 OR-construction, 83 Minhashing, 63, 72, 76, 82, 294 Orthogonal vectors, 224 Minicluster, 238 Out-component, 151 Minutiae, 94 Outlier, 223 Mirrokni, V.S., 111 Overfitting, 299 Mirror page, 57 Overture, 271 Mitzenmacher, M., 110 Own pages, 170 Moments, 127 Monotonicity, 194 Paepcke, A., 111 Most-common elements, 139 Page, L., 145, 182 326 INDEX PageRank, 3, 16, 27, 29, 40, 145, Randomization, 208 147, 159 Rarest-first order, 281 Pairs, see Frequent pairs Rastogi, R., 144, 260 Park, J.S., 220 Rating, 288, 291 Pass, 192, 195, 202, 208 Recommendation system, 16, 287 Paulson, E., 54 Recursion, 40 PCY Algorithm, 200, 202, 203 Reduce task, 22, 24 Pedersen, J., 182 Reduce worker, 25, 27 Perfect matching, 267 Reed, B., 54 Permutation, 63, 68 Reina, C., 260 PIG, 53 Relation, 29 Piotte, M., 320 Relational algebra, 29, 30 Plagiarism, 57, 187 Replication, 21 Pnuts, 53 Representation, 247 Point, 221, 251 Representative point, 243 Point assignment, 223, 234 Representative sample, 119 Polyzotis, A., 53 Reservoir sampling, 144 Position indexing, 102, 104 Retained set, 238 Positive integer, 138 Revenue, 272 Powell, A.L., 260 Ripple-carry adder, 138 Power law, 13 RMSE,