Understanding Android Benchmarks “freedom” koan-sin tan [email protected] OSDC.tw, Taipei Apr 11th, 2014

1 disclaimers

• many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from “Computer Architecture: A Quantitative Approach,” 1st ~ 5th ed • opinions expressed here are my personal one, don’t reflect my employer’s view

2 who am i

• did some networking and security research before • working for a SoC company, recently on • big.LITTLE scheduling and related stuff • parallel construct evaluation • run benchmarking from time to time • for improving performance of our products, and • know what our colleagues' progress

3 • Focusing on CPU and memory parts of benchmarks • let’s ignore graphics (2d, 3d), storage I/O, etc.

4 Blackbox

! • google image search “”, you can find many of them are Android-related benchmarks • Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox

5 Is Apple A7 good?

• When Apple released the new iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up • commonly used ones: • GeekBench • JavaScript benchmarks • Some graphics benchmarks • Why? Are they right ones? etc.

e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review 6 open blackbox

7 Android Benchmarks

8 http:// www.anandtech.com /show/7384/state-of- cheating-in-android- benchmarks No, not improvement in this way

9 Assuming there is not cheating, what we we can do? Outline

• Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future

11 To quote what Prof. Raj Jain quoted

• Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems

From: “The Devil’s DP Dictionary” S. Kelly-Bootle

12 Why benchmarking

• We did something good, let check if we did it right • comparing with own previous results to see if we break anything • We want to know how good our colleagues in other places are

13 What to report?

• Usually, what we mean by “benchmarking” is to measure performance • What to report? • intuitive answer: how many things we do in certain period of time • yes, time. E.g., MIPS, MFLOPS, MiB/s, bps

14 MIPS and MFLOPS

• MIPS (Million Instrucons per Second), MFLOPS (Million Floang-Point Operaons per Second) • All instrucons are not created equal – CISC machine instrucons usually accomplish a lot more than those of RISC machines, comparing the instrucons of a CISC machine and a RISC machine is similar to comparing Lan and Greek

15 MIPS and what’s wrong with them

• MIPS is instrucon set dependent, making it difficult to compare MIPS of one computers with different ISA • MIPS varies between programs on the same computers; and most importantly, • MIPS can vary inversely to performance –w/ hardware FP, generally, MIPS is smaller

16 MFLOPS and what’s wrong with them

• Applied only to programs with floang-point operaons • Operaons instead of instrucons, but sll –floang-point instrucons are different on machines different ISAs –Fast and slow floang-point operaons • Possible soluon: weight and source code level count –ADD, SUB, COMPARE : 1 –DIVIDE, SQRT: 2 –EXP, SIN: 4

17 • The best choice of benchmarks to measure performance is real applicaons

18 Problemac benchmarks

• Kernel: small, key pieces of real applicaons, e.g., • Toy programs: 100-line programs from beginning programming assignments, e.g., quicksort • Synthec benchmarks: fake programs invented to try to match the profile and behavior of really applicaons, e.g., Dhrystone

19 Why they are disreputed?

• Small, fit in cache • Obsolete instrucon mix • Uncontrolled source code • Prone to compiler tricks • Short runmes on modern machines • Single-number performance characterizaon with a single benchmark • Difficult to reproduce results (short runme and low-precision UNIX mer)

20 Dhrystone

• Source –hp://homepages.cwi.nl/~steven/dry.c • < 1000 LoC –Size of CA15 binary compiled with bionic • Instrucons: ~ 14 KiB

text data bss dec 13918 467 10266 24660

21 Whetstone

Test MFLOPS MOPS ms • Dhrystone is a pun on N1 float 119.78 0.16 N2 float 171.98 0.78 Whetstone N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 • Source code: hp:// N6 float 84.22 6.41 N7 equal 86.84 2.13 www.netlib.org/ N8 exp 5.95 6.26 benchmark/whetstone.c MWIPS 463.97 21.55

22 More on Synthec benchmarks

• The best known examples of synthec benchmarks are Whetstone and Dhrystone • Problems: – Compiler and hardware opmizaons can arficially inflate performance of these benchmarks but not of real programs – The other side of the coin is that because these benchmarks are not natural programs, they don’t reward opmizaons of behaviors that occur in real programs • Examples: – Opmizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instrucons unnecessary – Most Whetstone floang-point loops execute small numbers of mes or include calls inside the loop. These characteriscs are different from many real programs – Some more discussion in 1st edion of the textbook

23 LINPACK

• LINPACK: a floang point benchmark from the manual of LINPACK library • Source –hp://www.netlib.org/benchmark/linpackc –hp://www.netlib.org/benchmark/linpackc.new • 883 LoC –Size of CA15 binary compiled with bionic • Instrucons: ~ 13 KiB text data bss dec 12670 408 0 13086 24 25 CoreMark (1/2)

• CoreMark is a benchmark that aims to measure the performance of central processing units (CPU) used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the anquated Dhrystone benchmark • The code is wrien in C code and contains implementaons of the following algorithms: – Linked list processing. – Matrix (mathemacs) manipulaon (common matrix operaons), – state machine (determine if an input stream contains valid numbers), and – CRC • from wikipedia

26 CoreMark (2/2)

• CoreMark vs. Dhrystone name LoC core_list_join.c 496 –Reporng rule –Use of library calls, e.g., core_matrix.c 308 malloc() is avoided core_stat.c 277 –CRC to make sure data are core_util.c 210 corrected • However, CoreMark is a kernel + synthec benchmark, sll quite small footprint text data bss dec 18632 456 20 19108 27 So?

• Too overcome the danger of placing eggs in one basket, collecons of benchmark applicaons, called benchmark suites, are popular measure of performance of processors with variety of applicaons • Standard Performance Evaluaon Corporaon (SPEC)

28 29 Why CPU2000 in 2010s?

• Why ARM scks with SPEC CPU2000 instead of CPU2006 –1999 q4 results, earliest available CPU2000 results (hp:// www.spec.org/cpu2000/results/res1999q4/) • CINT2000 base: 133 – 424 • CFP2000 base: 126 – 514 name CA9 CA7 CA15 Krait SPECint 200 356 320 537 326 SPECfp 2000 298 236 567 350 –2005 Opteron 144, 1.8 GHz All normalized to 1.0 GHz • 1,440 (CA15 1.9 GHz reported nVidia is 1,168) –CPU2006 requires much more DRAM, 1 GiB DRAM is not enough

30 SPEC numbers from Quantave Approach 5th Edion

31 How long does SPEC CPU2000 take?

Reference Base Base Benchmark Time Runtime Ratio 164.gzip 1400 215 652 • About 1 hrs to compile 175.vpr 1400 198 707 176.gcc 1100 94.8 1161 181.mcf 1800 266 677 • Runme: Sum of base 186.crafty 1000 118 850 197.parser 1800 291 619 252.eon 1300 87.8 1480 runme mulplied by 3 253.perlbmk 1800 172 1045 254.gap 1100 107 1026 255.vortex 1900 211 899 – E.g., 1.7 GHz CA15, 256.bzip2 1500 203 740 300.twolf 3000 399 752 (2256+3229) x 3 = 16,455 s ~= SPECint_base2000 2256 854 4.57 hr

Reference Base Base Benchmark Time Runtime Ratio – For 1.0 GHz: 4.57 x 1.7 = 7.77 68.wupwise 1600 162 991 171.swim 3100 389 797 hr 172.mgrid 1800 339 532 173.applu 2100 241 870 177.mesa 1400 112 1254 – For CA7 assuming twice slower: 178.galgel 2900 201 1444 179.art 2600 195 1332 183.equake 1300 157 828 7.77 * 2 = 15.54 hr 187.facerec 1900 183 1036 188.ammp 2200 353 623 189.lucas 2000 134 1491 191.fma3d 2100 212 988 200.sixtrack 1100 241 456 301.apsi 2600 310 839

SPECfp_base2000 435 3229 909.6

32 Figure 1.16 SPEC2006 programs and the evoluon of the SPEC benchmarks over me, with integer programs above the line and floang-point programs below the line. Of the 12 SPEC2006 integer programs, 9 are wrien in C, and the rest in C++. For the floang-point programs, the split is 6 in Fortran, 4 in C++, 3 in C, and 4 in mixed C and Fortran. The figure shows all 70 of the programs in the 1989, 1992, 1995, 2000, and 2006 releases. The benchmark descripons on the le are for SPEC2006 only and do not apply to earlier versions. Programs in the same row from different generaons of SPEC are generally not related; for example, fpppp is not a CFD code like bwaves. Gcc is the senior cizen of the group. Only 3 integer programs and 3 floang-point programs survived three or more generaons. Note that all the floang-point programs are new for SPEC2006. Although a few are carried over from generaon to generaon, the version of the program changes and either the input or the size of the benchmark is oen changed to increase its running me and to avoid perturbaon in measurement or dominaon of the execuon me by some factor other than CPU me.

33 EEMBC

• Embedded Microprocessor Benchmark Consorum (EEMBC): 41 kernels used to predict performance of different embedded applicaons: – Automove/industrial – Consumer – Networking – Office automaon – Telecommunicaon • 3rd edion showed some EEMBC results, 4th edion changed the mind • Unmodified performance and “full-fury” performance • Kernel, reporng opons – Not a good predictor of relave performance of different embedded computers

34 Report benchmark results

• Reproducible – Machine configuraon (Hardware, soware (OS, compiler etc.)) • Summarizing results – You should not add different numbers • Some use weighted average – Rao, compare with a reference machine • Geometric rao – The geometric mean of the raos is the same as the raos of geometric means – The rao of the geometric means is equal to the geometric mean of the performance raos

35 Geometric mean

36 • Fallacy: Benchmarks remain valid indefinitely –Ability to resist “benchmark engineering” or “benchmarkeng” –gcc is the only survivor from SPEC89 • Almost 70% of all programs from SPEC2000 or earlier were dropped from the next release

37 Other benchmarks

name kernel bytes/iter FLOPS/iter • Stream COPY a(i) = b(i) 16 0 SCALE a(i) = q*b(i) 16 1 –To test memory bandwidth SUM a(i) = b(i) + c(i) 24 1 TRIAD a(i) = b(i) + q*c(i) 24 2 –It also tests floang-point performance –Opons of floang-point (double, 8 bytes) array • copy, scale, add, triad • lmbench –Micro benchmark to measure soware/hardware overhead from soware perspecve –lmbench paper (1996), hp://www.bitmover.com/ lmbench/lmbench-usenix.pdf

38 for (k=0; k

Stream 5.10

39 lmbench

• lmbench is a micro-benchmark suite designed to focus aenon on the basic building blocks of many common system applicaons, such as databases, simulaons, soware development, and networking

40 Parallel? Let’s look at other SPEC benchmarks • SPECapc for 3ds Max™ 2011, performance evaluaon soware for systems running Autodesk 3ds Max 2011. • SPECapcSM for Lightwave 3D 9.6, performance evaluaon soware for systems running NewTek LightWave 3D v9.6 soware. • SPECjbb2005, evaluates the performance of server side by emulang a three-er client/server system (with emphasis on the middle er). • SPECjEnterprise2010, a mul-er benchmark for measuring the performance of Java 2 Enterprise Edion (J2EE) technology-based applicaon servers. • SPECjms2007, Java Message Service performance • SPECjvm2008, measuring basic Java performance of a Java Runme Environment on a wide variety of both client and server systems. • SPECapc, performance of several 3D-intensive popular applicaons on a given system • SPEC MPI2007, for evaluang performance of parallel systems using MPI (Message Passing Interface) applicaons. • SPEC OMP2001 V3.2, for evaluang performance of parallel systems using OpenMP (hp://www.openmp.org) applicaons. • SPECpower_ssj2008, evaluates the energy efficiency of server systems. • SPECsfs2008, File server throughput and response me supporng both NFS and CIFS protocol access • SPECsip_Infrastructure2011, SIP server performance • SPECviewperf 11, performance of an OpenGL 3D graphics system, tested with various rendering tasks from real applicaons • SPECvirt_sc2010 ("SPECvirt"), evaluates the performance of datacenter servers used in virtualized server consolidaon

41 PARSEC

• The Princeton Applicaon Repository for Shared-Memory Parallelization Model Workload Computers (PARSEC) is a Pthreads OpenMP Intel TBB benchmark suite composed of multhreaded programs. The blackscholes Yes Yes Yes bodytrack Yes Yes Yes suite focuses on emerging canneal Yes No No workloads and was designed to be dedup Yes No No representave of next-generaon facesim Yes No No shared-memory programs for ferret Yes No No chip-mulprocessors fluidanimate Yes No Yes freqmine No Yes No • Didn’t really use it yet raytrace Yes No No

• hp://parsec.cs.princeton.edu/ streamcluster Yes No Yes

swaptions Yes No Yes vips Yes No No x264 Yes No No

42 Are Dhrystone usefully?

• Yes, if you know the limitation of them • Don't do marketing as those benchmarks mean real user perceived performance

43 DMIPS/MHz) 8.00'' 7.00'' 6.00'' 5.00'' 4.00'' 3.00'' 2.00'' 1.00'' 0.00'' iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46''

A7 Dhrystone

44 MFLOPS/GHz+ 800'

700'

600'

500'

400'

300'

200'

100'

0' iPhone'5s'32, iPhone'5s' 'CA15' CA7' Krait'400' bit' MFLOPS/GHz' 722' 723' 449' 119' 299'

A7 linpack MFLOPS

45 CoreMark/MHz+ 7.00''

6.00''

5.00''

4.00''

3.00''

2.00''

1.00''

0.00'' iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30''

A7 CoreMark

46 Different items

• Example, GeekBench 3 • Arithmetic mean with different weight? How? • Good properties of geometric mean

47 Source code

• So far what we talked about are all software with source code available, either publicly/freely, e.g., Dhrystone or little amount of $, e.g., SPEC CPU

48 • Benchmark scores/results usually depend on compiler, complier flags, processors, and systems

49 Outline

• Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future

50 Back to Android

• What kinds of Benchmarks are available, or used to compare performance • Apps with native benchmarks: Antutu, GeekBench • Java apps, e.g., Quadrant • Hybrid: with both native and Java, e.g., AndEBench and CF-Bench • We also use SPEC CPU2000 and other stuff internally

51 Ars Technica List

arrayOfPackageInfo[0] = new PackageInfo("com.aurorasoftworks.quadrant.ui.standard", false); arrayOfPackageInfo[1] = new PackageInfo("com.aurorasoftworks.quadrant.ui.advanced", false); arrayOfPackageInfo[2] = new PackageInfo("com.aurorasoftworks.quadrant.ui.professional", false); arrayOfPackageInfo[3] = new PackageInfo("com.redlicense.benchmark.sqlite", false); arrayOfPackageInfo[4] = new PackageInfo("com.antutu.ABenchMark", false); arrayOfPackageInfo[5] = new PackageInfo("com.greenecomputing.linpack", false); arrayOfPackageInfo[6] = new PackageInfo("com.greenecomputing.linpackpro", false); arrayOfPackageInfo[7] = new PackageInfo("com.glbenchmark.glbenchmark27", false); arrayOfPackageInfo[8] = new PackageInfo("com.glbenchmark.glbenchmark25", false); arrayOfPackageInfo[9] = new PackageInfo("com.glbenchmark.glbenchmark21", false); arrayOfPackageInfo[10] = new PackageInfo("ca.primatelabs.geekbench2", false); arrayOfPackageInfo[11] = new PackageInfo("com..", false); arrayOfPackageInfo[12] = new PackageInfo("com.flexycore.caffeinemark", false); arrayOfPackageInfo[13] = new PackageInfo("eu.chainfire.cfbench", false); arrayOfPackageInfo[14] = new PackageInfo("gr.androiddev.BenchmarkPi", false); arrayOfPackageInfo[15] = new PackageInfo("com.smartbench.twelve", false); arrayOfPackageInfo[16] = new PackageInfo("com.passmark.pt_mobile", false); arrayOfPackageInfo[17] = new PackageInfo("se.nena.nenamark2", false); arrayOfPackageInfo[18] = new PackageInfo("com.samsung.benchmarks", false); arrayOfPackageInfo[19] = new PackageInfo("com.samsung.benchmarks:db", false); arrayOfPackageInfo[20] = new PackageInfo("com.samsung.benchmarks:es1", false); arrayOfPackageInfo[21] = new PackageInfo("com.samsung.benchmarks:es2", false); arrayOfPackageInfo[22] = new PackageInfo("com.samsung.benchmarks:g2d", false); arrayOfPackageInfo[23] = new PackageInfo("com.samsung.benchmarks:fs", false); arrayOfPackageInfo[24] = new PackageInfo("com.samsung.benchmarks:ks", false); !arrayOfPackageInfo[25] = new PackageInfo("com.samsung.benchmarks:cpu ! CPU and Memory related: Quadrant, Antutu, linpack, GeekBench, AndEBench (coremark), CaffeineMark, Pi, PassMark, Samsung’s benchmark 52 Antutu 3.x

• CPU: integer, floating point • memory: RAM • Graphics: 2D, 3D • I/O: Database, SD read, SD write !

! • What are you benchmarking • What's you workload • How to calculate scores

53 What on earth are they doing? • Actually no public available information • But, with good enough background knowledge and proper tools (we’ll talk about these later), we can figure it out • It turns out most of them are from the BYTE (http://en.wikipedia.org/wiki/ NBench)

54 AnTuTu 3.x CPU and Memory Tests

Antutu percentage on nbench item Used by Antutu Antutu part progress bar Order nbench category NUMERIC SORT yes Integer 27% 4 integer STRING SORT yes RAM 1% 1 memory BITFIELD yes RAM 1% 2 memory FP EMULATION no FOURIER yes floating 47% 7 floating point ASSIGNMENT yes RAM 8% 3 memory IDEA yes Integer 27% 5 integer HUFFMAN yes Integer 34% 6 integer NEURAL NET no LU DECOMPOSITION no

55 More close look

▪ RAM – String sort: • string Heap sort: StrHeapSort() • MoveMemory() à memmove() – Bit Field: • Bit field test: DoBitops() – Assignment: • Task Assignment test: DoAssignment() ▪ Integer – Numeric sort: • Numeric heap sort: NumHeapSort() – IDEA: • IDEA encryption and decryption: cipher_idea() – Huffman: • Huffman encoding ▪ Floating point: – Fourier: • Fourier transform: pow(), sin(), cos()

56 String Sort in NBench

for(i=top; i>0; --i)! Sorts an array of strings {! • "strsift(optrarray,strarray,numstrings,0,i);! ! of arbitrary length "/* temp = string[0] */! "tlen=*strarray;! "MoveMemory((farvoid *)&temp[0], /* Perform exchange */! ""(farvoid *)strarray,! ""(unsigned long)(tlen+1));! ! Test memory movement ! • "/* string[0]=string[i] */! performance "tlen=*(strarray+*(optrarray+i));! "stradjust(optrarray,strarray,numstrings,0,tlen);! "MoveMemory((farvoid *)strarray,! ""(farvoid *)(strarray+*(optrarray+i)),! ""(unsigned long)(tlen+1));! ! Non-sequential "/* string[i]=temp */! • "tlen=temp[0];! "stradjust(optrarray,strarray,numstrings,i,tlen);! performance of cache, "MoveMemory((farvoid *)(strarray+*(optrarray+i)),! ""(farvoid *)&temp[0],! ""(unsigned long)(tlen+1));! with added burden that ! moves are byte-wide and } can occur on odd address boundaries

57 Bit field in NBench

• Executes 3 bit manipulation functions • Exercises "bit twiddling“ performance. Travels through memory bit-by-bit in a sequential fashion; different from sorts in that data is merely altered in place static void ToggleBitRun(farulong *bitmap, /* Bitmap */ ulong bit_addr, /* Address of bits to set */ Operations: ulong nbits, /* # of bits to set/clr */ • uint val) /* 1 or 0 */ { unsigned long bindex; /* Index into array */ Set: OR 1 unsigned long bitnumb; /* Bit number */ • ! while(nbits--) Clear: AND 0 { • #ifdef LONG64 bindex=bit_addr>>6; /* Index is number /64 */ bitnumb=bit_addr % 64; /* Bit number in word */ Toggle: XOR #else • bindex=bit_addr>>5; /* Index is number /32 */ bitnumb=bit_addr % 32; /* bit number in word */ #endif Set, clear: ToggleBitRun() if(val) • bitmap[bindex]|=(1L<

58 Assignment in NBench

• The test moves through large integer arrays in both /* row-wise and column-wise ** Step through rows. For each one that is not currently ** assigned, see if the row has only one zero in it. If so, ** mark that as an assigned row/col. Eliminate other zeros fashion. Cache/memory ** in the same column. */ with good sequential for(i=0;i

59 Numeric Sort in NBench

Sorts an array of long • static void NumHeapSort(farlong *array, integers with heap sort ulong bottom, /* Lower bound */ ulong top) /* Upper bound */ { ulong temp; /* Used to exchange elements */ ulong i; /* Loop index */ ! Generic integer /* • ** First, build a heap in the array performance. Should */ for(i=(top/2L); i>0; --i) NumSift(array,i,top); exercise non-sequential ! /* performance of cache ** Repeatedly extract maximum from heap and place it at the ** end of the array. When we get done, we'll have a sorted ** array. (or memory if cache is */ for(i=top; i>0; --i) { NumSift(array,bottom,i); less than 8K). Moves 32- temp=*array; /* Perform exchange */ *array=*(array+i); bit longs at a time, so *(array+i)=temp; } 16-bit processors will be return; at a disadvantage

60 IDEA Encryption in NBench

static void cipher_idea(u16 in[4],! ""u16 out[4],! ""register IDEAkey Z)! {! register u16 x1, x2, x3, x4, t1, t2;! /* register u16 t16;! register u16 t32; */! int r=ROUNDS;! ! x1=*in++;! x2=*in++;! IDEA: a new block x3=*in++;! • x4=*in;! ! cipher when nbench was do {! "MUL(x1,*Z++);! "x2+=*Z++;! in development "x3+=*Z++;! "MUL(x4,*Z++);! ! "t2=x1^x3;! "MUL(t2,*Z++);! "t1=t2+(x2^x4);! "MUL(t1,*Z++);! Moves through data "t2=t1+t2;! • ! "x1^=t1;! sequentially in 16-bit "x4^=t2;! ! "t2^=x2;! chunks "x2=x3^t1;! "x3=t2;! } while(--r);! MUL(x1,*Z++);! *out++=x1;! *out++=x3+*Z++;! *out++=x2+*Z++;! MUL(x4,*Z);! *out=x4;! return;! }

61 Huffman in NBench

• Everybody knows Huffman code, right? • A combination of byte operations, bit twiddling, and overall integer manipulation ..... /* ** Huffman tree built...compress the plaintext */ bitoffset=0L; /* Initialize bit offset */ for(i=0;i

62 Fourier in NBench

• No, not FFT, • Good measure of transcendental and trigonometric performance of FPU. Little array activity, so this test should not be dependent of cache or memory architecture

static double thefunction(double x, /* Independent variable */! ""double omegan, /* Omega * term */! ""int select) /* Choose term */! {! /*! ** Use select to pick which function we call.! */! switch(select)! {! "case 0: return(pow(x+(double)1.0,x));! "case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));! "case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));! }

63 Neural Net in NBench

• A robust algorithm for solving linear equations • Small-array floating-point test heavily dependent on the exponential function; less dependent on overall FPU performance

64 LU Decomposition in NBench

• LU Decomposition • Yes, the LU decomposition you learned in linear algebra • A floating-point test that moves through arrays in both row-wise and column-wise fashion. Exercises only fundamental math operations (+, -, *, /)

65 GeekBench

• A cross-platform one • The only publicly available one we could use to compare Android, iOS, and other platforms • Quite clearly described test items • http://support.primatelabs.com/kb/geekbench/geekbench-3- benchmarks • Explaining how to interpret results • http://support.primatelabs.com/kb/geekbench/interpreting- geekbench-3-scores • Source code available if you pay

66 Vellamo

• HTML5 • Metal: Dhrystone, Linpack, Branch-K, Stream 5.9, RamJam, Storage • some are well-known; some are written by Quic? • Anyway, all of them are described at http:// www.quicinc.com/vellamo/test-descriptions/

67 CFBench

• Used by some people, ‘cause • Test both Java and native version • its author is quite active in xda developer forum • Some problems • no good description of tests • some code is wrong, e.g., • its Native Memory Read test is not testing memory read, ‘cause malloc()ed array is not initialized

68 Outline

• Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future

69 How do we improve benchmark performance

70 • In the good old days, we have source code, we compile and run benchmark programs • In current Android ecosystem • Usually we don’t have source • Profiling: oprofile, perf, DS-5 • profiling sometimes doesn’t report real bottleneck function, e.g., static functions usually are inlined and don’t have symbol in shipped binaries • binutils: nm, readelf, objdump, gdb • Improving libraries, e.g., libc and libm, and runtime system, e.g., JIT of Dalvik, used by those benchmarks

71 Antutu 3.x

• memmove() in bionic --> bcopy() in C • rewrite with NEON assembly code • pow(), sin(), cos() in C • rewrite them with assembly

72 bcopy() in bionic !in bionic/libc/bionic/memmove.c void *memmove(void *dst, const void *src, size_t n) { const char *p = src; char *q = dst; MoveMemory() in nbench /* We can use the optimized memcpy if the source and destination • * don't overlap. */ -> memmove() in bionic - if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n)) || ((p < q) && ((size_t)(q - p) >= n)), 1)) { return memcpy(dst, src, n); > bcopy() in bionic } else { bcopy(src, dst, n); return dst; } memcpy() assembly in } • in bionic/libc/string/bcopy.c

bionic and there are /* * Copy a block of memory, handling overlap. * This is the routine that actually implements processor specific ones * (the portable versions of) bcopy, memcpy, and memmove. */ (CA9, CA15, Krait). #ifdef MEMCOPY void * memcpy(void *dst0, const void *src0, size_t length) NEON (vector load/ #else #ifdef MEMMOVE void * store) helps memmove(void *dst0, const void *src0, size_t length) #else void bcopy(const void *src0, void *dst0, size_t length) #endif not for bcopy() #endif • { .....

73 Antutu 3.x

• For people with source code • Selection of toolchain and compiler options may cause huge difference, e.g., bit field • Some version of binary for Antutu 3.x was compiled with Intel, bit-by-bit operations turned in word-wide (32-bit) operations, and the speed up is about 70x faster

74 Stream copy usually turned into memcpy()

75 remote gdb

1. get /system/bin/app_process and /system/bin/linker of the target system and necessary shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so • adb pull /system/bin/app_process! • adb pull /system/bin/linker lib/armeabi-v7a/! • adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/ armeabi-v7a/! 2. arm--gnueabi-gdb ./app_process 3. on the target device, attach gdbserver to the running process you wanna debug • ./gdbserver --attach :5039 3484 4. set shared library search path • (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a 5. ‘adb forward tcp:5039 tcp:5039’ and set remote target • (gdb) target remote :5039 6. you can set breakpoints, print backtrace, disassemble, etc.

76 • (gdb) b Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned • (gdb) disassemble

Dump of assembler code for function Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned: 0x74b65848 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr} => 0x74b6584c <+4>: bl 0x74b654ac 0x74b65850 <+8>: mov.w r0, #1048576 ; 0x100000 0x74b65854 <+12>: blx 0x74b65358 0x74b65858 <+16>: movs r6, #0 0x74b6585a <+18>: movw r9, #9999 ; 0x270f 0x74b6585e <+22>: mov r8, r0 0x74b65860 <+24>: bl 0x74b6547c 0x74b65864 <+28>: add.w r5, r8, #1048576 ; 0x100000 0x74b65868 <+32>: mov r10, r0 0x74b6586a <+34>: mov r3, r8 0x74b6586c <+36>: ldr.w r2, [r3], #4 0x74b65870 <+40>: cmp r3, r5 0x74b65872 <+42>: add r4, r2 0x74b65874 <+44>: bne.n 0x74b6586c 0x74b65876 <+46>: bl 0x74b6547c 0x74b6587a <+50>: adds r6, #1 0x74b6587c <+52>: rsb r7, r10, r0 0x74b65880 <+56>: cmp r7, r9

77 Quadrant

• Written in Java • CPU: Not really testing CPU • Memory: profiling shows that memcpy() is heavily in used • What can we do • optimized JIT part of DVM

78 What other possible ways?

• binary translation during • installation time • run time

79 Wrap-up

• Popular CPU and Memory benchmarks on Android mostly don’t reflect real CPU performance • We know CPU performance != System performance != user-perceived performance • There is always room for improvement

80 So?

81 Recent progress

• EEMBC’s AndEBench 2.0 is under development (http:// www.eembc.org/press/pressrelease/130128.html) • Qualcomm asked BDTi to develop new benchmark (http://www.qualcomm.com/media/blog/2013/08/16/ mobile-benchmarking-turning-corner-user- experience). • Samsung with other vendors launched MobileBench Consortium last year • Antutu is still growing

82 Thanks! 廣告

• MediaTek joined • And, it’s looking for linaro.org last month open source engineers • linaro.org is a NPO • Talk to guys at MTK working on open source booth or me Linux/Android related stuff for ARM-based • There are more non- SoCs open source jobs • So MTK is getting more open recently

84 backup

85 Some References to Understand Performance Benchmark

• Raj Jain, “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling”, Wiley, 1991 • Quantitative Approach • A good SPEC introduction article, http://mrob.com/ pub/comp/benchmarks/spec.html • Kaivalya M. Dixit, “Overview of the SPEC Benchmarks,” http://people.cs.uchicago.edu/~chliu/ doc/benchmark/chapter9.pdf

86 Basic system parameters ------Host OS Description Mhz tlb cache mem scal pages line par load bytes ------localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1 ! Processor, Processes - times in microseconds - smaller is better ------Host OS Mhz null null open slct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc ------localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654 ! Basic integer operations - times in nanoseconds - smaller is better ------Host OS intgr intgr intgr intgr intgr bit add mul div mod ------localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8 ! Basic float operations - times in nanoseconds - smaller is better ------

87 Context switching - times in microseconds - smaller is better ------Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ------localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6 ! *Local* Communication latencies in microseconds - smaller is better ------Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn ------localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357. ! File & VM system latencies in microseconds - smaller is better ------Host OS 0K File 10K File Mmap Prot Page 100fd Create Delete Create Delete Latency Fault Fault selct ------localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048 ! *Local* Communication bandwidths in MB/s - bigger is better ------Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem

88 PARSEC content

• Blackscholes This applicaon is an Intel RMS benchmark. It calculates the prices for a porolio of European opons analycally with the Black-Scholes paral differenal equaon (PDE). There is no closed-form expression for the Black- Scholes equaon and as such it must be computed numerically. • Bodytrack This computer vision applicaon is an Intel RMS workload which tracks a human body with mulple cameras through an image sequence. This benchmark was included due to the increasing significance of computer vision algorithms in areas such as video surveillance, character animaon and computer interfaces. • Canneal This kernel was developed by Princeton University. It uses cache-aware simulated annealing (SA) to minimize the roung cost of a chip design. Canneal uses fine-grained parallelism with a lock-free algorithm and a very aggressive synchronizaon strategy that is based on data race recovery instead of avoidance. • Dedup This kernel was developed by Princeton University. It compresses a data stream with a combinaon of global and local compression that is called 'deduplicaon'. The kernel uses a pipelined programming model to mimic real-world implementaons. The reason for the inclusion of this kernel is that deduplicaon has become a mainstream method for new-generaon backup storage systems. • Facesim This Intel RMS applicaon was originally developed by Stanford University. It computes a visually realisc animaon of the modeled face by simulang the underlying physics. The workload was included in the benchmark suite because an increasing number of animaons employ physical simulaon to create more realisc effects. • Ferret This applicaon is based on the Ferret toolkit which is used for content-based similarity search. It was developed by Princeton University. The reason for the inclusion in the benchmark suite is that it represents emerging next- generaon search engines for non-text document data types. In the benchmark, we have configured the Ferret toolkit for image similarity search. Ferret is parallelized using the pipeline model.

89 PARSEC content

• Fluidanimate This Intel RMS applicaon uses an extension of the Smoothed Parcle Hydrodynamics (SPH) method to simulate an incompressible fluid for interacve animaon purposes. It was included in the PARSEC benchmark suite because of the increasing significance of physics simulaons for animaons. • Freqmine This applicaon employs an array-based version of the FP-growth (Frequent Paern-growth) method for Frequent Itemset Mining (FIMI). It is an Intel RMS benchmark which was originally developed by Concordia University. Freqmine was included in the PARSEC benchmark suite because of the increasing use of data mining techniques. • Raytrace The Intel RMS applicaon uses a version of the raytracing method that would typically be employed for real- me animaons such as computer games. It is opmized for speed rather than realism. The computaonal complexity of the algorithm depends on the resoluon of the output image and the scene. • Streamcluster This RMS kernel was developed by Princeton University and solves the online clustering problem. Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteriscs. • Swapons The applicaon is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a porolio of swapons. Swapons employs Monte Carlo (MC) simulaon to compute the prices. • Vips This applicaon is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is offered at the Naonal Gallery of London, which is also the current maintainer of the system. The benchmark includes fundamental image operaons such as an affine transformaon and a convoluon. • X264

90