for Program Analysis and Generation

Robert van Engelen

10/1/09 Research Seminar First, a little story… Step 0: School

We learned to program in school… Step 1: College

… then told to forget what we learned and start over…

// Assignment 1: cupsof.java // Submitted by: * bucks // Shows good Java coding style and commenting

import java.lang.*;

public class cupsof { public static void main(String[] arg) { // print 500 times something to cheer about for (int count = 0; count < 500; count++) System.out.println(count + “ cups of java on the wall”); } } Step 2: Graduation

… all the while doing our best to impress professors… Step 3: Business

… to find our dream job! The Experts Told Us…

Carefully design your programs! “Controlling complexity is the essence of .” (Brian Kernigan)

Don’t hack! “If debugging is the process of removing bugs, then programming must be the process of putting them in.” (Edsger W. Dijkstra)

But don’t feel too bad about mistakes? “The best thing about a boolean is even if you are wrong, you are only off by a bit.” (Anonymous)

Other programming languages may offer salvation, but we don’t use them “There are only two kinds of programming languages: those people always bitch about and those nobody uses.” (Bjarne Stroustrup) Programming = ?

= Solving problems? – Specify the problem and find a (software) tool to solve it… – … if only we had tools that powerful to solve anything! • For certain domains: Excel, R, Mathematica, Maple, MATLAB, TurboTax, Garmin/TomTom, Blackboard, etc. – … otherwise, if we can’t use a tool or we design an Programming = ?

= Writing code? – No one really writes code… – … we write abstract specifications that we usually call programs – … only the / writes code by translating our program into machine instructions – The compiler complains when your specification has syntactic errors or static semantic errors Programming = ?

= Documenting, testing, and debugging? – Run‐time errors and logical errors are not caught by – We must specify unit and regression tests – Unless we use Dilbert’s agile programming ;‐) Programming = Specifying?

1. Specify an algorithm (design)

2. Specify a program (implementation)

3. Specify tests (testing)

Lather, rinse, repeat… Programming = Specifying?

1. Specify an algorithm (design)

2. Specify a program (implementation)

3. Specify tests (testing)

Lather, rinse, repeat… Questions

• How can well‐designed programming languages prevent programming mistakes?

• How can we use static analysis and formal methods to analyze programs for errors?

• How can we formally specify an algorithm and generate efficient code for it? Some Comments on Design • Principle of least surprise (POLS) • Uniform Access Principle (UAP) for OOP • No pointer arithmetic, references are OK • Static typing • Orthogonality of programming constructs • Exception handling • Support for assertions and invariants (as in Eiffel) • And also … (this depends on the target apps) – Referential transparency (functional programming) – Immutable objects (functional programming) There are Lots of Tools out There to Check Your Code

Most tools target C, C++, C#, and/or Java

Static analysis of source code (bug sniffing): Lint and splint GNU tool for bug‐sniffing C PC‐lint by Gimpel for bug‐sniffing C++

Model checking (steps through every possible execution path): Klocwork C/C++, C#, and Java analysis Coverity C/C+ analysis

Dynamic analysis (detecting memory leaks and/or race conditions) Valgrind, Dmalloc, Insure++, TotalView

… and many more Static Analysis

Static Semantic Analysis s s Data Flow Analysis erifier Compiler

Abstract Interpretation am V ogr Pr Model Checking Formal Verification Axiomatic Semantics Denotational Semantics

Operational Semantics tion oma

Logical Inference Semi‐aut Theorem Proving Better Programming with Tools?

A programmer once wrote

for( int i = 0; i < 5; i++ ) p++;

He probably meant to increase the pointer p by 5 Better Programming with Tools?

for( int i = 0; i < 5; i++ ) p++;

Fortunately, a compiler with good static analysis will optimize this to p += 5

Can it do the same for the following loop? If so, how?

for( int i = 0; i < n; i++ ) p++; Better Programming with Tools?

Let’s rewrite

for( int i = 0; i < n; i++ ) p++; into the de‐sugared form

int i = 0; while( i < n ) { p = p + 1; i = i + 1; } is this the same as p = p + n ? Formal Proof: Axiomatic Semantics

{ 0 < n ∧ p = q } weakest precondition { 0 < n ∧ p – q = 0 } i = 0; apply assignment rule { i < n ∧ p – q = i } loop invariant while( i < n ) { { i < n ∧ p – q = i } i < n ∧ loop invariant { i+1 < n ∧ p+1 – q = i+1 } p = p + 1; apply assignment rule { i+1 < n ∧ p – q = i+1 } i = i + 1; apply assignment rule { i < n ∧ p – q = i } loop invariant } { i > n ∧ i < n ∧ p – q = i } i > n ∧ loop invariant { p = q + n } postcondition Formal Proof: Axiomatic Semantics { Q[V\E] } weakest precondition V := E { Q } postcondition

{ (!C∨P1)∧(C∨P2) } weakest precondition if (C) {

{ P1 } precondition of S1 S1 { Q } postcondition } else {

{ P2 } precondition of S2 S2 { Q } postcondition } { Q } postcondition Formal Proof: Axiomatic Semantics

{ P1 } precondition S1; { Q1 } postcondition s.t. Q1 implies P2 { P2 } precondition S2; { Q2 } postcondition

{ Inv } weakest precondition (Inv = the loop invariant) while (C) {

{C∧Inv } precondition of S S { Inv } postcondition of S

} { !C∧Inv } postcondition The Good, the Bad, and the Ugly

for( int i = 0; i < 5; i++ ) p++; is optimized by a good compiler to p += 5;

When we prefer elegant code, we should not have to optimize it by hand to ugly fast code: the compiler does this for you in most (but not all) cases int gcd( int a, int b ) { while (b != 0) int gcd( int a, int b ) { register int t = b; b = a % b; { if (0 == b) a = t; return a; } return gcd(b, a % b); return a; } } The Good, the Bad, and the Ugly

Many inefficiencies can be optimized away by compilers

But compilers optimize without regard to parallel execution!

x = 1; // compiler removes this dead code x = 0; The Good, the Bad, and the Ugly

Many inefficiencies can be optimized away by compilers

But compilers optimize without regard to parallel execution!

Process 0 Process 1 x = 1; // removed if (x == 1) x = 0; exit(0); Syntactic Mistakes are Easy to Detect by Compilers/Static Checking Tools

1 int a[10000]; 2 3 void f() 4 { 5 int i; 6 7 for( i = 0; i < 10000; i++ ); 8 a[i] = i; 9 } Syntactic Mistakes are Easy to Detect by Compilers/Static Checking Tools

1 if( x != 0 ) 2 if( p ) *p = *p/x; 3 else 4 if ( p ) *p = 0; Compilers/Static Checking Tools Warn About Data Type Usage Mistakes 4 unsigned a[100] = {0}; 5 6 int main() 7 { 8 char buf[200]; 9 unsigned n = 0; 10 11 while( fgets( buf, 200, stdin ) ) 12 { 13 if( n < 100 ) a[n++] = strlen(buf); 14 } 15 while( ‐‐n >= 0 ) 16 { 17 printf( "%d\n", a[n] ); 18 } 19 return 0; 20 } Static Checking Tools Warn About Data Compatibility Mistakes

1 x = 4; 2 if( x >= 0 ) 3 x = (x > 0); 4 else 5 x = ‐1; 6 x = x % 2; Static Checking Tools Warn About Execution Order Mistakes 3 void out( int n ) 4 { 5 cout << n << "\n"; 6 } 7 8 void show( int a, int b, int c ) 9 { 10 out( a ); out( b ); out( c ); 11 } 12 13 int main() 14 { 15 int i = 1; 16 show( i++, i++, i++ ); 17 return 0; 18 } Static Checking Tools Warn About Arithmetic/Logic Mistakes

3 void print_mod( int i, int n ) 4 { 5 if( n == 0 && i == 0 ) return; 6 printf( "%d mod %d == %d\n", i, n, i % n ); 7 } 8 9 int main() 10 { 11 for( int i = 0; i < 10; i++ ) 12 for( int j = 0; j < 10; j++ ) 13 print_mod( i, j ); 14 return 0; 15 } Static Checking Tools Warn About Loop Index Mistakes

1 int a[10]; 2 3 i = 0; 4 for( i = 0; i < 10; i++ ) 5 sum = sum + a[i]; 6 weighted = a[i] * sum; Static Checking Tools Warn About Data Flow Mistakes

1 int shamrock_count( int leaves, double leavesPerShamrock ) 2 { 3 double shamrocks = leaves; 4 shamrocks /= leavesPerShamrock; 5 return leaves; 6 } 7 8 int main() 9 { 10 printf( "%d\n", shamrock_count( 314159, 3.14159 ) ); 11 return 0; 12 } More Difficult: Incorrect API Logic 1 int main() 2 { 3 FILE *fd = fopen( "data", "r" ); 4 char buf[100] = ""; 5 if( getline( fd, buf ) ) 6 fclose( fd ); 7 printf( "%s\n", buf ); 8 } 9 10 int getline( FILE *fd, char *buf ) 11 { 12 if( fd ) 13 { 14 fgets( buf, 100, fd ); 15 return 0; 16 } 17 return 1; 18 } More Difficult: Dynamic Typing/Data Flow Mistakes 1 class BankAccount 2 3 def accountName 4 @accountName = "John Smith" 5 end 6 7 def deposit 8 @deposit 9 end 10 11 def deposit=(dollars) 12 @deposit = dollars 13 end 14 15 def initialize () 16 @deposet = 100.00 17 end 18 19 def test_method 20 puts "The class is working" 21 puts accountName 22 end 23 end Abstract Interpretation

int[] a = new int[10]; i = 0; while( i < 10 ) { What is the range of i at each program point? … a[i] … Is the range of i safe to index a[i]? i = i + 1; } Abstract Interpretation

Define a lattice: [a,b] ⊔ [a’,b’] = [min(a,a’), max(b,b’)] [a,b] ⊓ [a’,b’] = [max(a,a’), min(b,b’)] int[] a = new int[10]; i = 0; i = [0,0] p0: while( i < 10 ) { i = [0,0] ⊓ [‐∞,9] = [0,0] i = [0,0] ⊔ [1,1] ⊓ [‐∞,9] = [0,1] … use acceleration to determine finite convergence p1: … a[i] … i = [0,9] p2: i = i + 1; i = [1,1] ⊓ [‐∞,9] = [1,1] i = [1,1] ⊔ [2,2] ⊓ [‐∞,9] = [1,2] } … use acceleration to determine finite convergence p3: i = [1,10]

i = [1,10] ⊓ [10,+∞] = [10,10] Model Checking

Process 1 Process 2 p : while( x > 0 ) { q : x = 0; Model 0 0 (using abstract traces p1: use resource q1: use resource forever x = x + 1; where x is positive or 0) } p0,q0,x>0 p2: sleep;

p1,q0,x>0 p0,q1,x=0 Q: starting with x > 1, will p1 ever concurrently execute p1,q1,x>0 p2,q1,x=0 with q1? p0,q1,x>0 Q: will execution reach a

state where x stays 0? p1,q1,x>0 Related Examples

C with assertion Programmer moved assertion for (i = 1; i < N/2; i++) { for (i = 1; i < N/2; i++) { k = 2*i – 1; assert(i > 0 && 2*i < N+1); assert(k >= 0 && k < N); k = 2*i+1; a[k] = … a[k] = …

Can we move the assertion before the loop? If so, how? Related Examples

Eiffel “design by contract” Methods must obey invariant indexing ... class decrement is COUNTER ‐‐ Decrease counter by one. feature require ... item > 0 invariant do item := item ‐ 1 item >= 0 ensure item = old item ‐ 1 end end (self-)commuting operators. Note that some of the CSEs GPAS algebraic simplifier, giving require an index substitution, denoted as [i a] where ← ! m the i-index is to be replaced with expression a. The nota- 1 b p = (u u ) (3) tion (where e.g. = ) denotes the use of two i,j h i+1,j,k − i,j,k i=i+c j=j+1 distinct i-indexes: a local i-index for the aggregate opera- k$=1 $ ! ! " tion (e.g. summation) from the current value of the global Note that the result is just slightly different from Eq. (2) i-index in the outer context of the operation increased by c, while at least " multiplications are saved. b ranging up to b. That is, i=i+c represents an exclusive Although the expression is algebraically simpler, the scan or reversed prefix operation. By introducing this no- RHS of Eq. (3) is still not optimal for generating parallel ! tational convention, reduction and scan operations can be code. Assume that the (i, j, k)-grid domain of the problem easily distinguished and optimized accordingly. is block-wise distributed in the j-direction. Then, excessive The presented list of CSEs is not exhaustive. In addition communication results because the parallel scan operation to the CSEs shown in Table 1, other CSEs can be found by is executed before the serial reduction. When the summa- using any combinations of the listed ‘primitive’ CSEs. For tions are interchanged, which is an algebraically valid trans- example, let E1 = fft(fft(ui,j,k, i = 1 . . n), j = 1 . . m) formation, the data volume communicated in the parallel and E2 = fft(ui,j,k+1, j = 1 . . m) be two (sub)expressions. scan will be significantly reduced, see Fig. 1. Assume that the fft-operator is declared as an instance of the self-commuting operator class. Then, with DICE we obtain m ! E1 = fft(E2[k k 1], i = 1 . . n) thereby saving an FFT step 1: step 2: ← − j=j+1 P1 k=1 P2 P3 operation. UponPROGRAMremoP1 ving CSEs, storage is allocated for REAL u(0:n+1,m,l),g(0:n+1),h,p(0:n+1,m),q(0:n+1) HP 712/60 2.30 0.502 0.361 the CSEs in theREALforms(0:n+1,0:m,l),t(0:n+1,m)of temporary variables. " " ... DO 2330 j = m,1,-1 SGI Indy 0.622 0.553 0.362 FORALL(i=0:n+1,k=1:l) s(i,j-1,k)=s(i,j,k)+u(i,j,k) 2330 CONTINUE CRAY-C98, 1 CPU 0.011 0.012 0.006 CMIC$ PARALLEL ... 3. ReductionCMIC$ CASEand Scan Optimization 2 CPUs 0.013 0.014 0.008 DO 2340 k = 1,l ↑ FORALL(i=1:n+1,j=1:m) t(i,j)=s(i,j,k)+t(i,j) k j 2340 CONTINUE MasP→ ar-MP1, i-distrib. 0.523 0.340 0.179 CMIC$ CASE FORALL(i=1:n) q(i)=g(i)*(s(i,0,l)-s(i-1,0,l))/h j-distrib. 237. 243. 164. The optimizationCMIC$ END CASE of reduction and scan operations for (interchange ) CMIC$ END PARALLEL ⇓ efficient serialFORALL(i=1:n+1,j=1:m)and parallel computingt(i,j)=t(i,j)/his an example of a FORALL(i=1:n,j=1:m) p(i,j)=t(i+1,j)-t(i,j) Table!2. Total"elapsedm time (sec). combined use of symbolic algebra techniques for commu- step 1: k=1 step 2: j=j+1 nication optimization. The optimization techniques are il- Generating CorrPROGRAM P2 ect and Efficient Code REAL u(0:n+1,m,l),g(0:n+1),h,p(0:n+1,m),q(0:n+1) " " lustrated by meansREAL s(0:n+1),t(0:n+1,m),T1(1:n+1,m)of a problem example. The examples the MasPar code with the j-direction distributed, excessive ... are simpliCMIC$fied PARALLELproblems... derived from the documentation of CMIC$ CASE communication between the front-end and the DPU proces- frtheom High‐LeHIRLAM weatherDO 2270 j forecast= 1,m modelvel Specific[5] and are illustrativeations FORALL(i=0:n+1) s(i)=s(i)+u(i,j,l) sor mesh results due to the fact that the j-reduction/scan 2270 CONTINUE ↑ for the typeCMIC$ofCASEproblems solved by CTADEL. Consider k jare not parallelized. The performance is disappointingly DO 2280 k = 1,l → FORALL(i=1:n+1) T1(i,m)=0 DO122901 j = m,2,-1 poor, especially for P1 and P2 in which the scan over re- FORALL(i=1:n+1)∂u T1(i,j-1)=T1(i,j)+u(i,j,k) 2290 CONTINUE duction optimization is prohibited by scan-reduction reuse p = FORALL(i=1:n+1,j=1:m)dy dz t(i,j)=t(i,j)+T1(i,j)(x, y) Ωx,y (1) 2280 CONTINUE ∂x Figure 1. Reducing communication between CMIC$ END0 CASEy ∀ ∈ (P1) or not applied (P2). Program P3 has the best perfor- CMIC$ #END #PARALLEL CMIC$ PARALLEL ... two processors P0 and P1 by interchanging CMIC$ CASE mance compared to programs P1 and P2 on all platforms where p(x, y) FORALL(i=1:n)and u(x, y,q(i)=g(i)z) are *dependent(s(i)-s(i-1))/hvariables, x, y, z the order of the summations. CMIC$ CASE investigated; P3 is also the program generated by CTADEL. FORALL(i=1:n+1,j=1:m) t(i,j)=t(i,j)/h 3 are the independentCMIC$ END CASE variables on theCtdomainadel Ω = [0, 1] . CMIC$ END PARALLEL DiscretizationFORALL(i=1:n,j=1:m)of Eq. (1) usingp(i,j)=t(i+1,j)-t(i,j)finite differences yields The interchange4. Conclusionsof the summations is automatically per- ! m formed by GPAS while in other SACs this can only be ac- PROGRAM1 P3 REAL u(0:n+1,m,l),g(0:n+1),h,p(0:n+1,m),q(0:n+1) complished by writing ad-hoc procedures. To this end, pi,j = REAL s(0:n+1),t(0:n+1,m)(ui+1,j,k ui,j,k) (i, j) D(p) In this paper we have briefly described methods for gen- ... h − ∀ ∈ k=CMIC$1 j=PARALLELj+1 ... GPAS uses eratingthe abstractefficientnotioncodesof commutatifor PDE-basedvity defiproblems.ned by The pre- $CMIC$$CASE DO 2300 j = 1,m (2) the class of commuting operators as briefly mentioned in FORALL(i=0:n+1) s(i)=s(i)+u(i,j,l) sented techniques have been used for the generation of effi- where D(2300p) isCONTINUEthe discretized domain or grid of p and the CMIC$ CASE Section 2.1.cientMorecodesspeciforficallythe H, ItRwLoAoperatorM weatherinstancesforecastofsystem using DO 2310 j = m,2,-1 u and p fields haFORALL(i=1:n+1)ve effectivelyt(i,j-1)=t(i,j)become functions of the dis- this class commute if and only if an explicit commutativity DO 2320 k = 1,l 3 a prototype version of the CTADEL application driver. The parallel HPF crete (i, j, k)-gridFORALL(i=1:n+1)with domaint(i,j-1)=t(i,j-1)+u(i,j,k)[1, n] [1, m] [1, "] ZZ . relationship between the operators is defined by the system 2320 CONTINUE optimization of multiple reductions and scans, however, is 2310 CONTINUE × × ⊆ Here, the CMIC$doubleEND CASEintegration is replaced with a double sum- or (re)defined by the user. The commutativity relationships code CMIC$ END PARALLEL a new approach. Since parallel reductions and scans re- mation usingCMIC$ thePARALLELmidpoint... quadrature formula, and the par- induce a default functional composition order on the oper- CMIC$ CASE quire expensive collective communications, the presented FORALL(i=1:n) q(i)=g(i)*(s(i)-s(i-1))/h tial derivatiCMIC$ve CASEis replaced by a difference quotient assuming ator instances of the class of commuting operators. This FORALL(i=1:n+1,j=1:m) t(i,j)=t(i,j)/h approach may yield a significant speedup of the generated grid-pointCMIC$distanceEND CASEh in the x-direction. way, the application order can be controlled while still al- CMIC$ END PARALLEL code. This is mainly due to the use of a high-level prob- FORALL(i=1:n,j=1:m) p(i,j)=t(i+1,j)-t(i,j) Eq. (2) can be simplified using an algebraic simplifier. lowing thelemfunctionaldescriptioncompositionin whichofalltheofoperatorsthe high-letovbeel information For algebraic simplification in CTADEL, we use the built-in interchanged which is necessary for application of rewrite is readily available for exploitation by the algebraic simpli- Figure 2. Three alternative programs, P1, P2, fier and common-subexpression eliminator within CTADEL. and P3, for solving the example problem.3 This information may be difficult to retrieve or may even be lost in the low-level architecture-specific code, thereby hampering the effective use of a restructuring compiler to Program P1 corresponds to Eq. (7), P2 to Eq. (8), and P3 obtain similar results. to Eq. (9). Note that the type of source-code level transfor- mations required to perform analogous transformations be- References tween the programs shown in Fig. 2 requires a very power- [1] G.O. Cook, Jr. and J.F. Painter. ALPAL: A Tool to Gener- ful restructuring compiler capable of recognizing reduction ate Simulation Codes from Natural Descriptions, Expert Sys- and scan operations and capable of data-structure transfor- tems for Scientific Computing, E.N. Houstis, J.R. Rice, and mations for reshaping array variables (array s in Fig. 2). R. Vichnevetsky (eds), Elsevier, 1992. With n = 254, m = 255, and ! = 20, performance re- [2] R. van Engelen and L. Wolters. A Comparison of Parallel Pro- gramming Paradigms and Data Distributions for a Limited sults on a HP 712/60, SGI Indy, CRAY-C98, and a MasPar- th MP1 (SIMD, 1024 PEs) are shown in Table 2. Area Numerical Weather Forecast Routine, proc. of the 9 ACM Int’l Conf. on Supercomp.:357–364, ACM Press, New On the HP 712, page swapping significantly degrades York, July 1995. performance of program P1; P1 requires the largest amount [3] R. van Engelen, L. Wolters, and G. Cats. Ctadel: A Generator of memory. On CRAY-C98 with 2 CPUs, the parallel of Multi-Platform High Performance Codes for PDE-based sections are executed in parallel, but unfortunately, over- Scientific Applications, proc. of the 10th ACM Int’l Conf. on head prohibits performance gain with respect to 1 CPU. For Supercomp.:86–93, ACM Press, New York, May 1996.

5 Generating Correct and Efficient Code from High‐Level Specifications

FLAME library code Some Concluding Remarks

• Static checking requires all compiler stages except the back‐end • Static checkers find deviations from “best practices” • Static checkers may find many false alarms • There are very few checkers for scripting languages (Perl, Python, Ruby, …) • Functional languages (Haskell, ML, …) are generally safer, but do not prevent logical programming errors