“Freedom” Koan-Sin Tan [email protected] OSDC.Tw, Taipei Apr 11Th, 2014

Understanding Android Benchmarks “freedom” koan-sin tan [email protected] OSDC.tw, Taipei Apr 11th, 2014 1 disclaimers • many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from “Computer Architecture: A Quantitative Approach,” 1st ~ 5th ed • opinions expressed here are my personal one, don’t reflect my employer’s view 2 who am i • did some networking and security research before • working for a SoC company, recently on • big.LITTLE scheduling and related stuff • parallel construct evaluation • run benchmarking from time to time • for improving performance of our products, and • know what our colleagues' progress 3 • Focusing on CPU and memory parts of benchmarks • let’s ignore graphics (2d, 3d), storage I/O, etc. 4 Blackbox ! • google image search “benchmark”, you can find many of them are Android-related benchmarks • Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox 5 Is Apple A7 good? • When Apple released the new iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up • commonly used ones: • GeekBench • JavaScript benchmarks • Some graphics benchmarks • Why? Are they right ones? etc. e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review 6 open blackbox 7 Android Benchmarks 8 http:// www.anandtech.com /show/7384/state-of- cheating-in-android- benchmarks No, not improvement in this way 9 Assuming there is not cheating, what we we can do? Outline • Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future 11 To quote what Prof. Raj Jain quoted • Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems From: “The Devil’s DP Dictionary” S. Kelly-Bootle 12 Why benchmarking • We did something good, let check if we did it right • comparing with own previous results to see if we break anything • We want to know how good our colleagues in other places are 13 What to report? • Usually, what we mean by “benchmarking” is to measure performance • What to report? • intuitive answer: how many things we do in certain period of time • yes, time. E.g., MIPS, MFLOPS, MiB/s, bps 14 MIPS and MFLOPS • MIPS (Million Instruc0ons per Second), MFLOPS (Million Floa0ng-Point Opera0ons per Second) • All instruc0ons are not created equal – CISC machine instruc0ons usually accomplish a lot more than those of RISC machines, comparing the instruc0ons of a CISC machine and a RISC machine is similar to comparing La0n and Greek 15 MIPS and what’s wrong with them • MIPS is instruc0on set dependent, making it difficult to compare MIPS of one computers with different ISA • MIPS varies between programs on the same computers; and most importantly, • MIPS can vary inversely to performance –w/ hardware FP, generally, MIPS is smaller 16 MFLOPS and what’s wrong with them • Applied only to programs with floa0ng-point operaons • Opera0ons instead of instruc0ons, but s0ll –floa0ng-point instruc0ons are different on machines different ISAs –Fast and slow floa0ng-point opera0ons • Possible solu0on: weight and source code level count –ADD, SUB, COMPARE : 1 –DIVIDE, SQRT: 2 –EXP, SIN: 4 17 • The best choice of benchmarks to measure performance is real applica0ons 18 Problema0c benchmarks • Kernel: small, key pieces of real applica0ons, e.g., linpack • Toy programs: 100-line programs from beginning programming assignments, e.g., quicksort • Synthe0c benchmarks: fake programs invented to try to match the profile and behavior of really applica0ons, e.g., Dhrystone 19 Why they are disreputed? • Small, fit in cache • Obsolete instruc0on mix • Uncontrolled source code • Prone to compiler tricks • Short run0mes on modern machines • Single-number performance characteriza0on with a single benchmark • Difficult to reproduce results (short run0me and low-precision UNIX 0mer) 20 Dhrystone • Source –hhp://homepages.cwi.nl/~steven/dry.c • < 1000 LoC –Size of CA15 binary compiled with bionic • Instruc0ons: ~ 14 KiB text data bss dec 13918 467 10266 24660 21 Whetstone Test MFLOPS MOPS ms • Dhrystone is a pun on N1 float 119.78 0.16 N2 float 171.98 0.78 Whetstone N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 • Source code: hp:// N6 float 84.22 6.41 N7 equal 86.84 2.13 www.netlib.org/ N8 exp 5.95 6.26 benchmark/whetstone.c MWIPS 463.97 21.55 22 More on Synthe0c benchmarks • The best known examples of synthe0c benchmarks are Whetstone and Dhrystone • Problems: – Compiler and hardware op0miza0ons can ar0ficially inflate performance of these benchmarks but not of real programs – The other side of the coin is that because these benchmarks are not natural programs, they don’t reward op0miza0ons of behaviors that occur in real programs • Examples: – Op0mizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instruc0ons unnecessary – Most Whetstone floa0ng-point loops execute small numbers of 0mes or include calls inside the loop. These characteris0cs are different from many real programs – Some more discussion in 1st edi0on of the textbook 23 LINPACK • LINPACK: a floa0ng point benchmark from the manual of LINPACK library • Source –hhp://www.netlib.org/benchmark/linpackc –hhp://www.netlib.org/benchmark/linpackc.new • 883 LoC –Size of CA15 binary compiled with bionic • Instruc0ons: ~ 13 KiB text data bss dec 12670 408 0 13086 24 25 CoreMark (1/2) • CoreMark is a benchmark that aims to measure the performance of central processing units (CPU) used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the an0quated Dhrystone benchmark • The code is wrien in C code and contains implementa0ons of the following algorithms: – Linked list processing. – Matrix (mathema0cs) manipula0on (common matrix opera0ons), – state machine (determine if an input stream contains valid numbers), and – CRC • from wikipedia 26 CoreMark (2/2) • CoreMark vs. Dhrystone name LoC core_list_join.c 496 –Repor0ng rule –Use of library calls, e.g., core_matrix.c 308 malloc() is avoided core_stat.c 277 –CRC to make sure data are core_util.c 210 corrected • However, CoreMark is a kernel + synthe0c benchmark, s0ll quite small footprint text data bss dec 18632 456 20 19108 27 So? • Too overcome the danger of placing eggs in one basket, collec0ons of benchmark applica0ons, called benchmark suites, are popular measure of performance of processors with variety of applica0ons • Standard Performance Evalua0on Corpora0on (SPEC) 28 29 Why CPU2000 in 2010s? • Why ARM s0cks with SPEC CPU2000 instead of CPU2006 –1999 q4 results, earliest available CPU2000 results (hp:// www.spec.org/cpu2000/results/res1999q4/) • CINT2000 base: 133 – 424 • CFP2000 base: 126 – 514 name CA9 CA7 CA15 Krait SPECint 200 356 320 537 326 SPECfp 2000 298 236 567 350 –2005 Opteron 144, 1.8 GHz All normalized to 1.0 GHz • 1,440 (CA15 1.9 GHz reported nVidia is 1,168) –CPU2006 requires much more DRAM, 1 GiB DRAM is not enough 30 SPEC numbers from Quan0ta0ve Approach 5th Edion 31 How long does SPEC CPU2000 take? Reference Base Base Benchmark Time Runtime Ratio 164.gzip 1400 215 652 • About 1 hrs to compile 175.vpr 1400 198 707 176.gcc 1100 94.8 1161 181.mcf 1800 266 677 • Run0me: Sum of base 186.crafty 1000 118 850 197.parser 1800 291 619 252.eon 1300 87.8 1480 run0me mul0plied by 3 253.perlbmk 1800 172 1045 254.gap 1100 107 1026 255.vortex 1900 211 899 – E.g., 1.7 GHz CA15, 256.bzip2 1500 203 740 300.twolf 3000 399 752 (2256+3229) x 3 = 16,455 s ~= SPECint_base2000 2256 854 4.57 hr Reference Base Base Benchmark Time Runtime Ratio – For 1.0 GHz: 4.57 x 1.7 = 7.77 68.wupwise 1600 162 991 171.swim 3100 389 797 hr 172.mgrid 1800 339 532 173.applu 2100 241 870 177.mesa 1400 112 1254 – For CA7 assuming twice slower: 178.galgel 2900 201 1444 179.art 2600 195 1332 183.equake 1300 157 828 7.77 * 2 = 15.54 hr 187.facerec 1900 183 1036 188.ammp 2200 353 623 189.lucas 2000 134 1491 191.fma3d 2100 212 988 200.sixtrack 1100 241 456 301.apsi 2600 310 839 SPECfp_base2000 435 3229 909.6 32 Figure 1.16 SPEC2006 programs and the evolu0on of the SPEC benchmarks over 0me, with integer programs above the line and floa0ng-point programs below the line. Of the 12 SPEC2006 integer programs, 9 are wrihen in C, and the rest in C++. For the floa0ng-point programs, the split is 6 in Fortran, 4 in C++, 3 in C, and 4 in mixed C and Fortran. The figure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descrip0ons on the les are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from different genera0ons of SPEC are generally not related; for example, fpppp is not a CFD code like bwaves. Gcc is the senior ci0zen of the group. Only 3 integer programs and 3 floa0ng-point programs survived three or more genera0ons. Note that all the floa0ng-point programs are new for SPEC2006. Although a few are carried over from genera0on to genera0on, the version of the program changes and either the input or the size of the benchmark is osen changed to increase its running 0me and to avoid perturba0on in measurement or domina0on of the execu0on 0me by some factor other than CPU 0me.

Load more