Big Data Benchmarking Workshop Publications

Big Data Benchmarking Workshop Publications

Benchmarking Datacenter and Big Data Systems Wanling Gao, Zhen Jia, Lei Wang, Yuqing Zhu, Chunjie Luo, Yingjie Shi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, Bizhu Qiu, Lixin Zhang, Jianfeng Zhan INSTITUTE OFTECHNOLOGY COMPUTING http://prof.ict.ac.cn/ICTBench 1 Acknowledgements This work is supported by the Chinese 973 project (Grant No.2011CB302502), the Hi- Tech Research and Development (863) Program of China (Grant No.2011AA01A203, No.2013AA01A213), the NSFC project (Grant No.60933003, No.61202075) , the BNSFproject (Grant No.4133081), and Huawei funding. 2/ Big Data Benchmarking Workshop Publications BigDataBench: a Big Data Benchmark Suite from Web Search Engines. Wanling Gao, et al. The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with ISCA 2013. Characterizing Data Analysis Workloads in Data Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization (IISWC-2013) Characterizing OS behavior of Scale-out Data Center Workloads. Chen Zheng et al. Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013). In Conjunction with ISCA 2013.[ Characterization of Real Workloads of Web Search Engines. Huafeng Xi et al. 2011 IEEE International Symposium on Workload Characterization (IISWC-2011). The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. Zhen Jia et al. Second workshop of big data benchmarking (WBDB 2012 India) & Lecture Note in Computer Science (LNCS) CloudRank-D: Benchmarking and Ranking Cloud Computing Systems for Data Processing Applications. Chunjie Luo et al. Front. Comput. Sci. (FCS) 2012, 6(4): 347–362 3/ Big Data Benchmarking Workshop Content Background and Motivation Our ICTBench Case studies 4/ Big Data Benchmarking Workshop Question One Gap between Industry and Academia Longer and longer distance • Code • Data sets 5/ Big Data Benchmarking Workshop Question Two Different benchmark requirements Architecture communities • Simulation is very slow • Small data and code sets System communities • Large-scale deployment is valuable. Users • There are three kind of lies: lies, damn lies, and benchmarks • Real-world applications 6/ Big Data Benchmarking Workshop Data Centers in the World Emerson December 2011 http://www.emersonnetworkpower.com/en-US/About/NewsRoom/Pages/2011DataCenterState.aspx 7/ Big Data Benchmarking Workshop State-of-Practice Benchmark Suites SPEC CPU SPEC Web HPCC PARSEC TPCC Gridmix YCSB 8/ Big Data Benchmarking Workshop Current Benchmarks Field Benchmark Name CPU SPEC CPU Web server SPEC Web CMP PARSEC OLTP TPC-C OLAP TPC-DS HPC HPCC, Linpack NoSQL YCSB Network httperf … … 9/ Big Data Benchmarking Workshop Why a New Benchmark Suite for Datacenter Computing No benchmark suite covers diversity of data center workloads State-of-art: CloudSuite Only includes 6 applications according to its popularity 10/ Big Data Benchmarking Workshop Why a New Benchmark Suite (Cont’) Memory Level Parallelism(MLP): Simultaneously outstanding cache misses C loudSuite our benchmark suite DCBench MLP 11/ Big Data Benchmarking Workshop Why a New Benchmark Suite (Cont’) Scale-out performance DCBench Cloudsuite Data analysis benchmark 6 5 sort grep wordcount 4 svm kmeans Speed Speed up fkmeans 3 all-pairs Bayes 2 HMM 1 1 4 8 Working nodes 12/ Big Data Benchmarking Workshop Content Background and Motivation Our ICTBench Case studies 13/ Big Data Benchmarking Workshop ICTBench Project Benchmarking Foundation of researches. Bridge ICTBench: three benchmark suites DCBench: architecture (application, OS, and VM execution) BigDataBench: System (large-scale big data application) CloudRank: Cloud benchmarks (distributed management) Project homepage http://prof.ict.ac.cn/ICTBench 14/ Big Data Benchmarking Workshop DCBench DCBench: typical data center workloads Different from scientific computing: FLOPS Cover applications in important domains • Search engine, electronic commence etc. Each benchmark = a single application Purposes Architecture system (small-to-medium) researches 15/ Big Data Benchmarking Workshop BigDataBench Characterizing big data applications Not including data-intensive super computing Synthetic data sets varying from 10G~ PB Each benchmark = a single big application. Purposes large-scale system and architecture researches An incremental approach Release a start-up benchmark suite • Workloads in the search engine system Other important domains 16/ Big Data Benchmarking Workshop CloudRank Cloud computing Elastic resource management Consolidating different workloads Cloud benchmarks Each benchmark = a group of consolidated data center workloads. Three benchmarks: services/ data processing/ desktop Purposes Capacity planning, system evaluation and researches User can customize their benchmarks. 17/ Big Data Benchmarking Workshop Benchmarking Methodology To decide and rank main application domains according to a publicly available metric e.g. page view and daily visitors To single out the main applications from main applications domains 18/ Big Data Benchmarking Workshop Top Sites on the Web Search Engine Social Network Electronic Commerce Media Streaming Others 15% 5% 40% 15% 25% Top Sites on the Web More details in http://www.alexa.com/topsites/global;0 19/ Big Data Benchmarking Workshop Benchmarking Methodology To decide and rank main application domains according to a publicly available metric e.g. page view and daily visitors To single out the main applications from main applications domains 20/ Big Data Benchmarking Workshop Algorithms in Top Sites: Search Engine Search Engine Social Network Algorithms used in Search: Electronic Commerce Media Streaming Pagerank Others Graph mining Segmentation 15% Feature Reduction 5% 40% Grep Statistical counting 15% Vector calculation sort Recommendation 25% …… Top Sites on The Web 21/ Big Data Benchmarking Workshop Our practice Building a sematic search engine (Chinese) ProfSearch • Search scientists or professionals • 267083 researchers across 260 universities and institutes • http://prof.ict.ac.cn/ 22/ Big Data Benchmarking Workshop ProfSearch Crawler Workloads • Scrapy Analysis Workloads • SVM, Naïve Bayes, K-means, HMM, CRFs, LSA, LDA Store and Management Workloads • HDFS – Storing unstructured web pages • HIVE – Storing semi-structured intermediate data • MySQL – Storing structured data extracted from the web Web Service Workloads • Sphinx 23/ Big Data Benchmarking Workshop Algorithms in Top Sites: Social Network Search Engine Social Network Electronic Commerce Media Streaming Algorithms used in Social Network: Others Recommendation Clustering 15% Classification Graph mining 5% 40% Grep 15% Feature Reduction Statistical counting Vector calculation 25% Sort …… Top Sites on The Web 24/ Big Data Benchmarking Workshop Algorithms in Top Sites: Electronic Commerce Search Engine Social Network Electronic Commerce Media Streaming Algorithms used in electronic Others commerce: Recommendation 15% Associate rule mining Warehouse operation 5% 40% Clustering 15% Classification Statistical counting Vector calculation 25% …… Top Sites on The Web 25/ Big Data Benchmarking Workshop Main Algorithms in Data Centers Segmentation Basic operation Warehouse operation Classification Data center Cluster Feature reduction algorithms Recommendation Vector calculate Association rule mining Graph mining 26/ Big Data Benchmarking Workshop Where Do Those Algorithms Exactly Used in Data Centers ? Here, lets’ investigate mostly used applications in data centers The ubiquitous search engine Frequently used recommendation sub-systems 27/ Big Data Benchmarking Workshop Main Arithmetic in Common Search Engines (Nutch) Sort Word Grep Merge Sort Segmentation Classification BFS Word Count Vector calculate Scoring & Sort DecisionTree Segmentation PageRank 28/ Big Data Benchmarking Workshop Algorithms in Search Engine graph mining grep & segmentation pagerank word count sort vector calculation 29/ Big Data Benchmarking Workshop Representative Algorithms in Search Engine Algorithms Role in the search engine graph mining crawl web page Grep abstracting content from HTML segmentation word segmentation pagerank compute the page rank value Word counting word frequency count vector calculation document matching sort document sorting 30/ Big Data Benchmarking Workshop Algorithms in Recommendation Sub-systems 31/ Big Data Benchmarking Workshop Representative Algorithms in Recommendation Sub-systems Algorithms Role in the recommendation sub-systems Classification classify web pages/user behavior Frequent pattern growth user log mining Hidden markov model information extraction Clustering/similarity analysis clustering web pages/user behavior Collaborative filtering recommendation Feature reduction text representation/user behavior representation Graph mining web link analysis 32/ Big Data Benchmarking Workshop Overview of DCBench Category Workloads Programmin language source g model Basic operation Sort MapReduce Java Hadoop Wordcount MapReduce Java Hadoop Grep MapReduce Java Hadoop Classification Naïve Bayes MapReduce Java Mahout Support Vector MapReduce Java Implemented Machine by ourself Cluster K-means MapReduce Java Mahout MPI C++ IBM PML Fuzzy k-means MapReduce Java Mahout MPI C++ IBM PML Recommendatio Item based MapReduce Java Mahout n Collaborative Filtering Association rule Frequent pattern MapReduce Java Mahout mining growth Segmentation Hidden

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    56 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us