Know Your Reviewed in this issue Limits to Parallel : P- Limits Completeness Theory, by Raymond Greenlaw, H. James Hoover, Walter Igor L. Markov L. Ruzzo. (Oxford University Press, University of Michigan 1995, ISBN-10: 0195085914, ISBN-13: 978-0195085914.)

h IT IS UNUSUAL to review a book published nineteenth century we understood that chemical 18 years ago. However, some books are ahead of reactions do not alter chemical elements listed in their time, and some prospective readers may have periodic tables. Both stories show that fundamental gotten behind the curve. To this end, the develop- limits were discovered, prohibiting initial scenarios. ment of commercial parallel software is clearly lag- However, this is not how these stories end. Perpetual ging behind initial hopes and promises, perhaps motion can be successfully emulated by tapping an because known limits to parallel computation have abundant source while the system remains been overlooked. isolated for practical purposes, e.g., GPS navigation The history of humankind includes several strik- satellites use solar energy to power their continual ing technological scenarios that seemed feasible transmissions. Another example is nuclear pro- and admitted promising demonstrations, but could pulsion in ballistic missile submarines that remain not be applied in practice. One example was perpe- submerged and isolated for years. Even the transmu- tual motion, defined as ‘‘motion that continues inde- tation of cheap metals into gold has been demon- finitely without any external source of energy’’. The strated in particle accelerators, and platinum-group hope was to build a machine doing useful work metals can be commercially extracted from spent without being resupplied with fuel. Records of per- nuclear fuel. Once scientists develop an understand- petual motion trials date back to the seventeenth ing of fundamental limits, engineers circumvent century. It took two centuries to formulate the laws these limits by reformulating the challenge or by of to show why perpetual motion other clever workarounds. in an isolated system is not possible. A second ex- Today, the business value in many industries is ample is the mythical philosopher’s stone that trans- fueled by computation, just like it was driven by muted base metals into gold through chemical steam engines during the industrial revolution and processes (in fact, published accounts with exper- backed up by precious metals during the tumultu- imental validation were as respected as modern-day ous Middle Ages. The need for faster computation research publications). However, by the late leads to significant investments into computing hardware and software. Just like Chemistry and Phy- sics were developed to study chemical reactions and energy conversion, was Digital Object Identifier 10.1109/MDAT.2012.2237133 developed in the last 60 years to study algorithms Date of current version: 11 April 2013. and computation. In particular, Complexity theory

78 2168-2356/13/$31.00 B 2013 IEEE Copublished by the IEEE CEDA, IEEE CASS, IEEE SSCS, and TTTC IEEE Design & Test studies the limits of computation, as illustrated by the CPU caches can boost memory performance, but notion of NP-complete problems (a standard text- only by a constant factor, and not entirely due to book is Michael Sipser’s ‘‘Introduction to the Theory parallel algorithms). Other signs that a claimed of Computation’’). Current consensus is that these speed-up is bogus can be subtle and ad hoc.For problems cannot be solved in worst-case polyno- fundamental problems, like Boolean SATand circuit mial time without major theoretical breakthroughs, simulation, that have consistently defied paralleli- and the knowledge accumulated in the field allows zation efforts by sophisticated researchers, a spec- one to quickly evaluate and diagnose purported tacular speed-up (e.g., 220 times claimed at ICCAD breakthroughs. Even the least-informed funding 2011 for SAT) better have a convincing and agencies would now recognize naBve attempts at unexpected explanation. Patrick Madden’s ASPDAC solving NP-complete problems in polynomial time. 2011 paper illustrates how academics often over- On the other hand, the understanding of such limits simplify the challenge they are studying and ignore guided applied algorithm development to identify best known techniques in their empirical compar- and exploit useful features of problem instances. An isons. David Bailey’s SC 1992 paper ‘‘Misleading example end-to-end discussion can be found in the Performance Claims in the Supercomputing Field’’ DAC 1999 paper ‘‘Why is ATPG Easy?’’ by Prasad, and its DAC 2009 reprise suggest that this phenom- Chong, and Keutzer. Moreover, for optimization enon is not new. problems, the notion of NP-hardness can sometimes The article ‘‘Parallel Logic Simulation: Myth or be circumvented by approximating optimal solu- Reality?’’ in the April 2012 issue of IEEE Computer tions (typical for geometric tasks, such as the Travel- offers a great exposition of the promise and the ling Salesman Problem). As a result, the software failure of parallel functional logic simulation (e.g., and hardware industries have been quite successful evaluating new circuit designs before silicon pro- in circumventing computational complexity limits duction). Many people find it obvious that Boolean in applications ranging from formal verification to circuit simulation should be easy to parallelize, and large-scale interconnect routing. And chess-playing academic papers claim such results. But imple- computers go far beyond NP. menting this idea in successful commercial software History does not repeat itself, but it often rhymes, has been a losing proposition for many years (leav- as Mark Twain noted. The latest craze in softwareV ing the market open to expensive hardware emula- parallel computingVhas given us hope to turn tors developedbyIBM,EVE,Cadence,Synplicity/ silicon (predesigned processor cores) into compu- Synopsys, and others). The authors of the IEEE tation without increasing clock speed and power Computer article dissect many failed attempts and dissipation per core. As top-of-the-line integrated the obstacles encountered. This is where careful circuits cost more than their weight in gold, the phi- observers may suspect fundamental limits. losopher’s stone pales in comparison to the value Enter the book Limits to Parallel Computation: proposition of turning not base metals, but sand into P-Completeness Theory by Greenlaw, Hoover, and something more expensive than gold. And we now Ruzzo. Just like NP-complete problems defy worst- see academics, instigated by U.S. funding agencies case polynomial-time algorithms, P-complete prob- left unnamed (to protect the guilty!), claim fantastic lems defy significant speed-ups through parallel parallel speed-ups that do not survive scrutiny. computation. The Preface says: Those who attended the panel on parallel Electronic Design Automation at ICCAD 2011 may recall that This book is an introduction to the rapidly I questioned claims of algorithmic ‘‘superlinear’’ growing theory of P-completenessVthe branch speed-up (more than k times when using k pro- of complexity theory that focuses on identifying cessors, for large k). If using k parallel threads of the ‘‘hardest’’ problems in the class P of prob- execution consistently improves single-thread run- lems solvable in polynomial time. P-complete time by more than a factor of k,thenwecouldjust problems are of interest because they all appear simulate k threads by time-slicing a single thread, to lack highly parallel solutions. That is, algo- with a factor-of-k slowdown. This yields a better se- rithm designers have failed to find NC algo- quential algorithm. Thus, the original comparison rithms, feasible highly parallel solutions that was to suboptimal sequential algorithms (using k take time polynomial in the logarithm of the

January/February 2013 79 problem size while using only a polynomial processor by unrolled combinational circuits. number of processors, for them. Consequently, Further analysis is based on formal notions of a the promise of parallel computation, namely computational problem, reducibility and complete- that applying more processors to a problem can ness. These notions lead to complexity classes, greatly speed its solution, appears to be broken such as P (problems solvable in polynomial time) by the entire class of P-complete problems. and NC (problems solvable by poly-sized circuits of polylogarithmic depth/delay, named ‘‘Nick’s class’’ Just like the well-known book ‘‘Computers after Nicholas Pippenger). Clearly, NC is contained and Intractability: A Guide to the Theory of NP- in P, but is believed to be smaller than P (just like Completeness’’ by Garey and Johnson, this book P is believed to be smaller than NP). Because consists of two partsVan introduction to the any problem in P can be efficiently reduced P-completeness theory, and a catalog of P-complete (NC-reduced) to any P-complete problem, finding a problems. It starts with an anecdote about a com- P-complete problem inside NC would contradict pany that was forced by its competitors to look into P 6¼ NC (Theorem 3.5.4). So, if you are comfortable parallel platforms and thus developed parallel sort- interpreting NP-complete as ‘‘likely not solvable in ing of n elements using n2 processors in Oðlog nÞ polynomial time,’’ you should be comfortable in- time. This example is used to motivate key concepts, terpreting P-complete as ‘‘likely not executable such as reductions and implied limits to parallel efficiently in parallel.’’ The prototypical P-complete computation (as in my earlier argument about the problems are circuit and program simulation. maximal speed-up due to k processors). Similar to P-complete problems are ‘‘inherently sequential’’ the theory of NP-completeness, this leads to the no- in the sense that P ¼ NC is unlikely. The reasons tion of P-complete problems to which many other have to do with the efficiency of highly parallel sim- problems can be reduced. Thus, when looking for ulation and can be summarized as follows: (i)gene- effective parallel solutions to a particular problem, ric simulation is slow, regardless of the algorithm one must first check for reductions to known P- used, (ii) fast special-case simulation techniques are complete problems. For example, Linear Program- not general enough, (iii) straightforward simulation ming and Maximum Flow (Problems A.4.3 and A.4.4 techniques are provably slow. An additional rule of in the catalog) are P-complete, while Maximum thumb is that efficiently parallelizable problems can Matching (Problem B.9.7) admits a highly-parallel usually be solved in polylog space (in addition to probabilistic algorithm. This catalog is substantial. It having access to the input). Details can be found in contains multiple variants of problems including re- David Johnson’s April 1983 Journal of Algorithms stricted versions. For example, Linear Programming column. with Two Variables per Constraint (Problem B.2.2) Focusing entirely on problems solvable in poly- and 0-1 Maximum Flow (B.9.6) admit highly-parallel nomial time would have excluded tasks in formal algorithms. Another quote verification and logic synthesis, which sometimes venture far beyond NP. There is no hope that using Additionally, P-completeness theory can polynomially many processors can make even guide algorithm designers in cases where a par- NP-complete problems poly-time solvable, which is ticular function has a highly parallel solution, consistent with empirical results on Boolean Satis- but certain algorithmic approaches to its com- fiability seen today. However, if we are interested in putation are not amenable to such solutions. polynomial-time heuristics for problems beyond P Computing breadth-first level numbers via (which is how most practical work is done), we queue- versus stack-based algorithms is an ex- might first try to parallelize tried-and-true sequential ample (see Chapter 8 for more details). heuristics. To this end, Greenlaw, Hoover, and Ruzzo show that sequential greedy algorithms frequently The two main models of parallel computation lead to solutions that are inherently sequential, i.e., are combinational Boolean circuits and shared- cannot be duplicated rapidly in parallel, unless memory multi-processors.Here,akeyissueisthe NC ¼ P. But sometimes equally good solutions can efficiency of parallel simulation of a Boolean cir- be produced by parallel algorithms. To this end, cuit on a multiprocessor and simulating a multi- P-complete algorithms are discussed through the

80 IEEE Design & Test proxy of tasks to reproduce the output of a particular Computer engineers have been ignoring funda- algorithm. For example, conventional algorithms for mental limits to parallel computation for years. For Gaussian elimination (with partial pivoting) appear example, the 2006 manifesto ‘‘The Landscape of inherently sequential, but other, highly parallel Parallel Computing Research: AView from Berkeley’’ algorithms exist for the same problem. This partic- does not mention complexity limits to parallel algo- ular example can be useful in SPICE-like accurate rithms and the concept of P-completeness. The electrical circuit simulation. More generally, such Berkeley engineering professors who authored the examples motivate pessimism about automatic pa- manifesto represented key parallel applications by rallelization by compilers given that compilers ‘‘13 dwarfs’’Vpatterns of computation and commu- generally do not invent entirely new algorithms. nication (extending the seven dwarfs defined by Section 9.2 shows some loopholes in dealing with Phil Collela). But we are not told that some of these P-complete problems and briefly discusses poly- dwarfs are in NC (easy to parallelize), some harbor nomial speed-up in parallelizing special cases of P-complete problems (combinational logic, certain the Circuit Value Problem, Depth-first Search, etc. It graph traversals), some are beyond P (branch-and- also shows how to upper-bound such speed-ups. bound) and some are too broad for generic analysis Another loophole is analogous to that in the (dynamic programming). Clearly, this classification NP-completeness theory and relies on quick (pa- is missing an important dimension. Not appreciating rallel) approximations that bypass exact algorithms computational complexity, computer engineers (Chapter 10). Unfortunately, this helps only in rare have been cranking out papers on parallel algo- cases (Section 10.2). Such logic seems more promis- rithms for P-complete problems without realizing ing in parallelizing approximate sequential solutions this (students can find those papers and match them to NP-complete problems (Section 10.3), such as bin to problems in Appendix A). David Bailey’s 1991 packing, 0-1 knapsack and schedulingVproblems note ‘‘Twelve Ways to Fool the when Giving inherent in load-balancing on parallel platforms. Performance Results on Parallel Computers’’ illus- David Johnson’s April 1983 J. Algorithms column trates what many of these papers do. On the other reviews such results. hand, efforts at parallelizing hard problems can be Appendices list problems whose status useful, just like ongoing efforts on practical sequen- (P-complete or not) is known, as well as open prob- tial algorithms for NP-hard and NP-complete pro- lems. Each problem is defined in a self-contained blems through approximation and exploiting way, and relevant problem reductions are outlined. instance structure. In any case, researchers must Circuit-related P-complete tasks include Problem clearly understand the fundamental limits they are A.1.9 Min-Plus Circuit Value Problem, which can be up against, and a summary of known results in viewed as a narrow form of Static Timing Analysis parallel algorithms clarifies what is achievable. For with rational values. Problem A.10.1 is a sweeping example, most highly parallelizable problems can generalization with real-valued numbers. Other be solved in polylogarithmic parallel time with a problems of relevance to Computer Engineering in- (close-to-) linear number of processors (in terms of clude Graph Partitioning, List Scheduling, Linear input size), but sorting and biconnected compo- Programming, Network Flow problems, certain ap- nents need n2 processors. proximations to Max-SAT and Min Set-Cover, and Given that the book under review was published even Two-Layer Channel Routing. Section A.7 lists 18 years ago, one may wonder if its conclusions re- problems dealing with formal languages (push- main valid today.To this end, proven theorems are in down automata, context-free grammars, etc.), and no danger of becoming outdated, and the ‘‘P versus captures various tasks performed by parsersVa NC’’ challenge remains unresolved, just like its close common bottleneck in parallel software. Later sec- relative ‘‘P versus NP’’. However, a few years after the tions include Gaussian elimination, various geom- book was published, Ketan Mulmuley proved that etry problems (triangulation, convex hulls, etc.), the P-complete max-flow problem cannot be solved several numerical analysis problems, as well as in polylogarithmic time using polynomially many Lempel-Ziv (LZ) compression. Fortunately, LZ is not processors in the PRAM model under certain as- an obstacle to parallel I/O because it is applied to sumptions. In case of future breakthroughs on this small blocks, not to entire files. topic, updates should promptly appear on the

January/February 2013 81 Wikipedia pages for the NC and NP classes. Modern dynamics). These mediocre speed-ups likely reflect formal treatment of multicore computing is avail- flaws in prevailing computer organization, where able in Leslie Vailant’s ESA 2008 paper ‘‘A Bridging heavy reliance on shared memories dramatically Model for Multi-Core Computing,’’ which discusses increases communication costs, but alternatives algorithms that are optimal for all combinations of would drastically complicate programming. machine parameters including the number of cores and the shape of the memory hierarchy. Other prac- AS IN HISTORICAL examples at the beginning of the tical aspects of parallel algorithms are explored in review, the last word on parallel algorithms seems to the 1997 volume ‘‘Parallel Algorithms: Third DIMACS be with loopholes. Even the core concepts we’ve Implementation Challenge.’’ For example, a chapter discussed exhibit subtle flaws. For example, binary by Papaefthymiou and Rodrigue points out that the search is obviously in NC, but cannot be parallelized Bellman-Ford algorithm runs faster in parallel on efficiently. The 1998 result of David Fisher questions dense graphs, but not on sparse graphs. the very relevance of the NC class in the physical A major technological change in parallel com- world, as no exponential worst-case parallel speed-up puting is the increasing dominance of communica- can be achieved in three (or any finite number of) tion over computation. It is not explicitly addressed dimensions, even if all interconnects can be routed by the theory of P-completeness, but computation with smallest possible lengths. Effective loopholes costs remain valid lower bounds anddetermine how here hide communication latencies by connecting much communication is needed. Thus, classical im- slow processor with fast interconnect, exploiting possibility results and lower bounds on computa- better-than-worst-case data patterns (through pipe- tion can still be trusted, but may be optimistic in lining and trading communication for computa- practice. To this end, the 1998 IEEE Transactions on tion), and scaling semiconductor by Computers paper ‘‘Your Favorite Parallel Algorithms using repeaters and electric tuning. The most pop- Might Not Be as Fast as You Think’’ by David Fisher ular loophole is to use an identical interconnect accounts for the finite density of processing ele- network for all input sets (up to 4 GB or, perhaps, ments in space, the (low) dimension d of the space 256 GB) and pretend that interconnect latencies in which parallel computation is performed, the fi- remain constant as problem size grows. But even nite speed of communication, and the linear growth zero-latency communication would not help with of communication delay with distance. Neglected in obstacles related to P-completeness. In particular, most publications, these four factors limit parallel the P-completeness of circuit and processor simula- speed-up to power ðd þ 1Þ.Consideringmatrix tion problems explains the difficulties encountered multiplication as an example where exponential by computer engineers when simulating new hard- speed-up is possible in theory, a two-dimensional waredesignsonparallelsystems(hereanimportant computing system (a planar circuit, a modern GPU, loophole is hardware emulation). Thus, by warning etc.) can offer at most a cubic speed-up. Given that about important pitfalls, keen understanding of the general result is asymptotic, it is significant only obstacles to parallelism can guide toward more ef- for large numbers of processing elements that fective solutions, clever ways to reformulate the communicate with each other. In particular, for problem, and applications where speed-up is easier circuits and FPGAs, it limits the benefits of three- to achieve (data-distributed tasks such as digital dimensional integration to power 4/3 (optimistically cinematography,computational astronomy and Web assuming a fully isotropic system). For two- search). In summary, I am convinced that the book dimensional GPUs, at most a cubic speed-up over under review can intellectually enrich Computer sequential computation is possible. To this end, a Engineering research and enhance the level of dis- 2012 report by the Oak Ridge Leadership Computing course in the scholarly literature. h Facility analyzed widely used simulation applica- tions (turbulent combustion; molecular, fluid and Acknowledgment plasma dynamics; seismology; atmospheric science; Several colleagues helped improve this book nuclear reactors, etc.). GPU-based speed-ups review. Dr. Michael Moffitt and Dr. Razit Topalog˘lu ranged from 1.4 to 3.3 times for ten applications from IBM, and Prof. Scott Aaronson from MIT and 6.1 times for the eleventh (quantum chromo- pointed out several important omissions in early

82 IEEE Design & Test drafts. Prof. Massimo Cafaro from Universita`del Saeedi from the University of Southern Californina Salento provided technical clarifications on helped improve readability. complexity classes (via MathOverflow). Dr. David Johnson of AT&T was helpful in discussing the rela- h Direct questions and comments about this tive paucity of new developments in the complexity article to Igor L. Markov, University of Michigan, of parallel computation since the book was pub- 2260 Hayward St., Ann Arbor, MI 48109-2121 USA; lished. Dr. Grant Martin from Tensilica and Dr. Mehdi [email protected].

January/February 2013 83