CVC Verilog Compiler – Fast Complex Language Compilers Can Be Simple

Home , Peter Naur

arXiv:1603.08059v2 [cs.PL] 12 Jan 2018 oefrX6adX86 ma and into X86 description descri for HDL code gate the electronic translates compiler and Verilog parallelism The (Wirth level Pascal to low similar case added language the level with low In a parts. is el HDL level of the gate ilog, parts declarative procedural and both systems describes tronic HDL The des hardware code. descri as language expressed are d are simulator systems hardware design, Verilog code electronic In fast machine a (HDL) language create scription to organizatio used simple methods The com development languages. fast computer implementing complex for for method ers novel a presents paper This Introduction 1. simulat Verilog GPU given. parallel are succeeded Reasons not incorrect. has be abst removing arg to detail The shown requires discussed. simulation is Verilog fast machines that virtual executi generated interpreted auto into sibly interpreter tri compliant that efforts 1364 initial full of a sl failure The compile a described. code from machine is based ulator history graph gui flow Development fast a specific method. into provides interpreter the and applying for datalogy Naur’s lines methodolog Peter science by computer expressed anti-formalism simu compiled the a code such validates been C how explains of has paper lines This simulators developers. 95,000 two required standard only av HDL compiler fastest Verilog The the developed. 2005 here, 1364 described IEEE method full formalist anti parallelis the level ing low processor modern machi utilize alternative to of sequences selection and comparison comp uses of model and abstract tion best compute the Neumann as von model) (MRAM the architecture uses com method other The for compilers languages. graph b computer flow can of here development the described It in algorithms plied simulators. and methods compiled the graph that descrip claimed flow hardware Verilog optimized develop (HDL) to language how explains paper This Abstract Cprgtntc ilapa eeoc pern’oto i option ’preprint’ once here appear will notice [Copyright fabrication. after behavior system electronic simulating 4mcie unn iu hti executed is that Linux running machines 64 V eio oplr–Fs Complex Fast – Compiler Verilog CVC agaeCmiesCnb Simple be Can Compilers Language removed.] s dt convert to ed aho einAutomation Design Tachyon no pos- of on .B us- By m. [email protected] otn A019 USA 02109, MA Boston, ato is raction l elec- All cription ecode ne fVer- of ptions. ailable best y ument sim- d 1975) and n chine tvnMeyer Steven ap- e lator plex bed. uta- tion and pil- ion de- ec- ow e- is r 1 0,0 ie,2,0 r2%aenee o h opiae V complicated the for feature. needed generate are 2005 20% ilog O or overlap). 20,000 and lines, routines 100,000 common si (numbers are and lines there 100,000 up because about proximate fix is (parser, c phases) elaborator based preparation Verilog graph lation The flow added. the was Ve then piler devel quality simulator, first Verilog commercial was competing interpreted simulator the an The of simulators. accuracy size full the ilog less or 10% is g eurdfrvr atVrlgsmlto ( simulation Verilog te fast register very machine for almost required display is ogy based which gen instance committee the with standard for 1364 compatible code the pr by elaboration young added the feature a implementing includes and training time intern included of student college jority years a ten as started The who grammer years. 10 then less measurements. accur process annotating manufacturing descr for from rules procedural information with (RTL) descriptions level in level gate standard transfer and The register 2005). both (IEEE I cludes standard the by HDL defined Verilog is 1364-2005 language description hardware Verilog The Description Simulator 2. co that blocks basic instructions. of machine graph virtual flow a tain into s memory) of ample representation simulato of internal cause the c the but (really In checking program similar pass. source does the generation generator code code the next here, scribed the and (an for notation operands result Polish and the to identifiers converted then of 128,0 information) only types updated with checked machine a memory on of run words compiler 9 a (p. for 1971) 6 (Gries pass compiler example, Algol Gier Pete cont the approach developing the is in to representation used similar sto is unified secondary transformed and one in modified which ally pass in each organization of b The results can passes storing (virtual) without various ecuted the that memory access random 2013). (Budiansky perhaps thinking algorithmic was modern that of use WWII first during p department English Blackett’s of ear role Patrick the The of trai methods). account mostly Budiansky’s 1971 for were See physicists. who 451-454 (Gries scientists by and compilers developed 411 were proble early compilers 410, applying develop history, compilers for to (8-10 multi-pass used the methods of specific analog modern a descriptions. bec system EDA Linux quality of enterprise size design on large physical only run (also here) tools discussed not (EDA) automation design tronic h iuao ossso bu 3,0 ie fCcd that code C of lines 330,000 about of consists simulator The h iuao ecie eewsdvlpdb w epein people two by developed was here described simulator The fast much so with equipped are and fast so now are Computers as viewed be can simulator compiled the of organization The idp area). uc be- ource t delay ate 2018/1/16 pdas oped hysicist onverts ueof ause r ap- are iptions chnol- output Naur r e as ned ma- a .For ). rage. ex- e de- r erate tools EEE the f inu- mu- om- the in- er- 00 o- n- m ly r- d - ) Generate allows compile time variables called parameters to MRAM (parallel ram) models (Hartmanis and Simon 1976). This change HDL variable sizes and design instance hierarchy structure. power of MRAM machines contributed to looking for other sources This feature creates instance specific constants that must be treated of parallel speed improvement and motivated using von Neumann as variables during simulation. The other Verilog simulators flat- machine architecture targeted language specific recursive descent ten designs resulting in much larger memory use for designs with parsing. many repeated instances and makes based addressing difficult. It A more direct result of rejecting formalism is the choice to is common for an IC to contain millions of instances of a latch or use the fast Cooper-Kennedy dominator algorithm (Cooper et al. flip flop macro cell each of which requires storing not just its per 2006). This choice and its use in the crucial optimization algorithm instance state information but also the machine instructions for ev- define-use lists is easy once one starts with skepticism toward al- ery instance when the flattened elaboration method is used. In the gorithms that have been formally proven to be correct. The Cooper simulator described here machine code to simulate one instance algorithm’s “real” speed is viewed as falsification of the axioms plus per instance state information and per instance base address- used in concrete complexity theory. ing change machine instructions only need to be stored. The devel- The main lesson of this anti-formal development method avoids opment of the Verilog 1364 standard by committee has continually using abstraction. Avoid machine generated tables and compiler made Verilog more complicated (the language reference manual is phases generated by automatic generators. Parse using language 590 pages) and has made it more difficult to implement the sim- specific recursive descend and use the C run time stack for re- ulator’s per instance base address register algorithms. The display membering context. Parse expressions using simple operator prece- offset algorithm results in a simpler code generator and faster sim- dence. This results in a very fast parser but is only possible because ulation at the cost of a much more complicated elaboration phase. assignment operators are not part of expressions in Verilog. Assume About 50,000 lines are needed to execute interpreted simula- proofs use axioms that do not apply to reality unless specifically de- tion. The compiler itself is only about 95,000 lines of which 15,000 termining that the axioms are good. Finally, following Naur, look are executable binary support libraries. 70,000 lines are needed to for algorithms that are simple so they can be adopted to the problem implement miscellaneous features used both by the compiler and specific aspects of Verilog and the data structures and algorithms of the interpreter: SDF delay annotation, four programming language electronic design simulation. interface APIs (tf , acc , vpi and dpi ), a debugger for the interpreter, toggle coverage recording and report generation, rarely used switch level simulation, plus an expression evaluation variant 4. Importance of von Neumann Observation on called X propagation that implements a more pessimistic unknown Weakness of Formal Machines (X state) evaluation algorithm. There is a history of compiler development that was closely tied to the von Neumann computer architecture. The best model of com- 3. Relation to Naur’s Datalogy and puting is the von Neumann machine itself. Von Neumann was skep- tical of models that used automata in general and understood that Anti-Formalism Turing Machines (TM) were too weak to model human program The theoretical background behind this simulator’s development writing expressed in his 1950s criticism of neural networks and method follows Naur’s methods. In the 1990s Peter Naur, one of the other simple automata. Von Neumann’s thinking is analyzed in de- founders of computer science, realized that CS had become formal tail in (Aspray 1990) and (Meyer 2016). Von Neumann explicitly mathematics separated from reality. Naur advocated the importance justified machine models that consist of unbounded memory cell of programmer specific computer program development that does size finite number of randomly accessible memory cells. not use preconceptions. The clearest explanation for Naur’s method TMs are too weak a model of computation because they lack that was used in developing this compiler appears in the book Con- random access and lack the ability to select bits from cells. In other versations - Pluralism in Software Engineering (Naur 2011). This words, the P=?NP question only exists for TMs not for von Neu- books amplifies the program development method Naur described mann architecture MRAM machines for which the class P is equal in his 2005 Turing Award lecture (Naur 2007). In (Naur 2011) page to the class NP (Hartmanis and Simon 1976) (Meyer 2016). For 30, the interviewer asks “... you basically say that there are no foun- von Neumann machines, guessing (non determinism) is no faster dations, there is no such thing as computer science, and we must than enumeration in the polynomial bound sense. This is expressed not formalize for the sake of formalization alone”. Naur answers, in CVC by the use of optimization algorithms that search not ex- “I am not sure I see it this way. I see these techniques as tools which haustively but using graph theory properties combined with human are applicable in some cases, but which definitely are not basic in conceptual problem solving. For decidable problems outside the any sense.” Naur continues (p. 44) “The programmer has to real- class NP such as the yes or no question “are two regular expres- ize what these alternatives are and then choose the one that suits sions equivalent?”, compiled computer languages allow people to his understanding best. This has nothing to do with formal proofs.” code their understanding into a computer program or electronic cir- Einstein described this as the 20th century split between axiomat- cuit description so that the program can solve concrete regular ex- ics and reality by saying that “axiomatics purges mathematics of pression equivalence problems. all extraneous elements” which makes it evident “that mathematics Two concrete examples of traditional compiler development as such can not predict anything” about reality (Einstein 1921). See anti-formalism are William Mckeeman et. al. development of the also (Naur 2005) and (Meyer 2013) for more detailed discussion of XPL compiler system (McKeeman et al. 1971) and development of Naur’s anti-formalism. Bell Labs C computer language and compiler (also development of Current compiler development methodology chooses formal al- Unix) (Ritchie 1993). Mckeeman’s contribution was understanding gorithms over Naur’s programmer specific approach that rejects compiler boot strapping. Dennis Ritchie and Ken Thompson under- pre-suppositions. During the development of this compiler, anoma- stood that the model they were developing a compiler and operating lies in mathematical foundations of logic that current computer sci- system for was the von Neumann computer itself instead of devel- ence takes as truth beyond criticism influenced design decisions. oping for some abstract model of computation that was for example The first is Paul Finsler’s proof that the continuum hypothesis is attempted in the Multics project. These pioneers of compiler devel- true (Finsler 1969). The proof is only indirectly related to computer opment also understood the importance of methods that allowed science. The second example is Juri Hartmanis’ proof that P=NP in one or two people to develop large computer programs.

2 2018/1/16 5. Development History Instead, modified versions of the very good algorithms spread The simulator described here was developed in the 1990s as a Ver- throughout the Morgan book are implemented. During develop- ilog simulator to compete with the original Verilog XL simulator ment of the code generator, new ideas were compared against the that used interpreted execution. At the time Verilog semantics was concrete Morgan book approach to make sure they were no worse. aimed at interpreted simulation because simulation properties such The Morgan book algorithms are especially useful because excep- as delays could be set at any time during a simulation run and be- tions are discussed. For example allowing non static single assign- cause a command line debugger was an integral part of Verilog, ment (SSA) form lists that violate the rule that each variable is as- i.e. simulations often expected to read Verilog source from script signed to only once or allowing define-use chains sometimes to not files at various times during a simulation run. In the late 1990s Ver- have exactly one element (p. 142). ilog native machine code compilers were introduced and Verilog Morgan’s idea that virtual instructions should be as close to semantics was changed to require specification of all design infor- machine instructions as possible is used (Figure 2.2, p. 24). The one mation and properties at compile time. See (Thomas and Moorby exception is that Verilog requires very complicated mostly boiler 2002) for a historical description of the Verilog HDL. plate prologue and epilogue instruction sequences that is just one By 2000 this simulator was no longer speed competitive so virtual instruction. it was used as the digital engine for an analog and mixed signal Code generator steps use one master representation accessible (AMS) simulator. The speed problem was not as serious for mixed from the interpreted data base. Both flow graphs and temp names signal simulation because analog simulation requires solving dif- (unbounded number of virtual registers that in Verilog are often ferential equations. Analog simulation numerical algorithms run wide and require two bits to represent 4 values) are accessed in nu- orders of magnitude slower than digital simulation. merous ways. Basic blocks are accessible directly from interpreter After the use in an AMS simulator, the simulator described execution form that then points to flow graphs. Flow graphs espe- here needed to be improved so that it would be speed competitive cially for net change propagation operators are accessed from in- with compiled digital Verilog simulators. The most obvious speed dexed tables and AVL trees (see igen.h in simulator source). The problem involved register transfer level (RTL) simulation. RTL machine code compiler adds information to the previous interpreted simulation is almost the same as normal programming language simulator data structures. execution except Verilog values require at least 2 computer bits A compiler organization that combines all information into one per Verilog bit and the RTL execution must interact with an event master data base is best. Any code generation phase can use any driven scheduler. of the information that is accessed either directly from the inter- The most obvious problem was that a number of if statements preter execution net list, from indices (sometimes indexed tables in the interpreter C code were needed to select which interpreter al- and sometimes trees), from code generation basic blocks or ma- gorithm to run. For example, a simple logic and (&) operator needs chine code routines. The idea is to continue to improve the global different evaluation C language code sections (usually one proce- data base during code generator development and to continually dure) for scalars (1 bit), narrow vectors (up to 32 or 64 bits), wide add more information to all of the various parts of the one unified vectors and strength model bit vectors (one byte per bit required). representation. For example, flow graph building algorithm insights In addition for all but scalars, both unsigned and signed execution were used to improve design elaboration data structures and inter- procedures are required. Sign extension for non integral number of preter execution data structures. word bit vectors requires significant calculation. The Verilog stan- This unified data base where each part is kept consistent with dard requires that at least up to one million bits wide vectors sim- other parts is the key to simplicity and code quality. For Verilog, ulate correctly. It seemed that taking the interpreter evaluation rou- because there are so many different types of operations from proce- tines and converting to high level instructions for a virtual machine dural RTL to declarative gate level to load and driver propagation, that could then be interpreted was a good idea (see for example depth first code generation is better. Morgan’s type of low level (Ertl and Gregg 2003)). Automatically generating a virtual Verilog machine instructions (p. 24) are generated with some optimization machine from the simulators interpreted evaluation code was tried by expanding constructs all the way down to something close to (Ertl et al. 2002). The development was not too hard, but the result- the final virtual instruction sequences when the flow graphs virtual ing execution speed increased performance by only a small amount. instruction sequences are built. Later mapping to machine instruc- Although the if statement overhead was removed, extra overhead tions except for X86 fixed registers is straight forward. This ap- to decode and execute the interpreter virtual instructions nullified proach may be Verilog specific because Verilog allows values to be most of the gains. read and written from anywhere in a design. Cross module references are allowed and a programming language interface (PLI) can 5.1 Concrete algorithms but not organization from Morgan’s run in any Verilog thread. There is effectively no usable context optimizing compiler book information in Verilog. For example, once some C code for the unified data structures of It was realized that the instruction level parallelism and branch pre- the flow graph and basic block mechanism were written, the Mor- diction provided by modern microprocessors was required for fast gan book detailed algorithms on define-use lists and importance of simulation. Development of a full code generator began. It was next SSA (12.5.1 p. 291 and 7.1, p. 142) could be applied. The heuristic attempted to implement the concrete step by step method in Robert is used to generate (top down depth first) as many temporaries as Morgan’s book on building an optimizing compiler (Morgan 1998). possible. Then SSA problems are fixed when needed or even al- The book describes the method used in building the very good Dig- lowed to violate SSA form that is recognized by the machine code ital Equipment Corporation Alpha microprocessor compilers. expander.. The crucial data structure used for optimization is the This simulator does not use the code generator organization define-use lists. Morgan suggests the idea of allowing non SSA in- from (Morgan 1998) (section 2.1, 21-26). Morgan advocates a ba- structions and temporaries (p. 142). Extra temporaries can then be sically breadth first filter down approach with a chain of transfor- eliminated during optimization passes through all flow graphs and mations each of which has a different data representation. Morgan virtual instruction lists. writes: “Each phase has a simple interface” that can be tested in iso- lation. Also, “No component of the compiler can use information about how another component of the compiler is implemented” (p. 21).

3 2018/1/16 5.2 Compiler file organization showed 2 to 5 times faster simulation by this simulator for two dis- Basic block creation and virtual instruction generation C code is in cussed designs. v bbgen’s C files. Define-use list and other flow graph elaboration Other anecdotal evidence is that optimization is so good that in code are in the v bbopt.c file. The v regasn.c file assigns machine one case full accuracy 1364 Verilog simulation is almost as fast as registers to the unbounded number of temps. The v cvcms.c and cycle based evaluation using a detail removing cycle based open v cvcrt.c files plus the v asmlnk.c file contain support C proce- source Verilog evaluator called Verilator for a Motorola 68k pro- dures plus code to generate the GNU AS assembly output, run gas cessor model. Such fast simulation is possible because the model and link the final output executable called cvcsim. The v aslib.c file really only requires 2 state not 4 state variables. contains wrapper C procedures that are called from the generated 8.1 Comparison with Open Source Icarus Verilog assembly but whose function is usually to call an interpreter execution procedure. By using wrappers, early versions of the compiler There is another open source Verilog simulator Icarus Verilog could compile almost all of Verilog but simulation was not yet fast called iverilog that implements most IEEE 1364 Verilog RTL fea- because execution used wrappers that executed the slow interpreter tures. It is difficult to use for benchmarking because it does not code. implement Verilog source macro cell library options (called -v and -y libraries). A sixty four bit version of the simulator described here is usually 35 to 45 times faster than Icarus Verilog release 6. What is Verilog called 10.0 stable made with the default options (-g -O) run on an Verilog is a Pascal like language (Wirth 1975) for the description Intel Core i5-5200U 2.2 GHz low power processor running Linux of electronic hardware. All variables are static because there are Centos 6.7. no implicit stacks in hardware. Verilog is a combination of nor- For one SHA1 check sum circuit, this simulator is 111 times mal behavioral programming with parallelism, execution of hard- faster. The SHA1 model is almost just a Pascal program so again the ware described at the RTL level and low level primitive declar- +2state option can be used. The two state option works by keeping ative gates and flip flops. A common electronic design method the X and Z values (called B part words) around but simulation only codes circuit descriptions in RTL then runs a program to synthe- needs to initialize the words to zero once. If a design simulation size the RTL into gates also coded in the Verilog language. Ver- really requires X and Z values, simulations run with this option ilog is used to simulate (predict behavior when an IC is fabricated) will be incorrect. Flow graph optimization normally removes B part both the RTL and the synthesized gates with accurate timing. See basic blocks from output machine code when X or Z values are (Thomas and Moorby 2002) for a description of the Verilog HDL. impossible. The two state option allows generating simpler flow See (Sutherland et al. 2006), pp. 401-413 for a history of the Ver- graphs that allows evaluation of the values for A (0 or 1) words to ilog HDL. See (Allen and Kennedy 2002), pp. 619-622 for a de- be optimized even more. scription of Verilog from the viewpoint of optimizing compiler development. 9. Simple Design Method for Complex Language Compilers 7. Why Verilog Simulations Run Slowly 9.1 Develop an interpreter using language specific simple Compared to Computer Programs concrete methods First, Verilog RTL values require 2 bits (4 values) for every hard- This simulator was developed at the same time the Verilog IEEE ware bit. Gate level accurate delay simulation also requires bus val- 1364 Verilog standard was defined and being continually changed. ues that allow 127 different values and driving strengths. A simple It is very difficult to simplify or shorten compiler development logic operation requires at least 3 or 4 instructions (not counting duration when a computer language is also under development. loads and stores). Second, hardware registers are almost always Some helpful ideas are: wider than the native von Neumann machine register width because 9.2 Use simple organization hardware design involves modeling the next generation electronics. This requires evaluating multiple machine words for each opera- One centralized include file is used by every C source file with tion. Third, Verilog requires event driven simulation. When a de- defined and used prototypes at the top of each source file. This lay or event control (@(clk3) say) is executed, the simulation must organization makes it easy to eliminate occurrences of more than schedule a new event in an event queue and suspend the current one routine for the same basic function (wide vector sign extension execution thread to be restarted later. A even slower process oc- for example). Also, it is important to be able to make a change in curs when a value is changed. All variables that are referenced in only one place when Verilog changes or when better simulation right hand side expressions driving the left hand side value must be algorithms are implemented. It is also usually not obvious when a evaluated and new assignments made. These effected left hand side new Verilog feature is first introduced what part of simulation is the variables are called loads. Also, when a wire is evaluated, it may same as some pre existing feature. be necessary to evaluate multiple drivers and determine which is the strongest. See (Meyer 1988) for a data structure that allows im- 9.3 Use only one internal organization (net list data structure plementation of efficient load propagation and driver competition for Verilog) algorithms. The first pass scans and parses Verilog source into normal module lists, statement and gate lists, expression trees and linked symbol tables. Then the next set of fix up procedures is used to fill the same 8. Performance date structure with more information which includes possibly to- The simulator described here is arguably the fastest full accuracy tally changing the instance tree hierarchy. Then the next elaboration 1364 Verilog simulator. Evidence here is anecdotal from users be- phase fills in more simulation preparation information and allocates cause the other commercial simulators are licensed with no public variable and state memory just before beginning interpreted sim- benchmarking allowed clauses. One anonymous published compar- ulation. This organization allows easily moving processing steps ison of this simulator with other compiled commercial simulators forward and backwards when the Verilog language changes. The for RTL but not gate level simulation speed (Anonymous 2011) compiler is run just before interpreted simulation would have been

4 2018/1/16 executed. It outputs a binary that is then run for compiled simula- graphs at the beginning of ﬁle v bbopt.c for the code that imple- tion. ments this.

9.4 Avoid generators, grammars and tables 9.9 Use low level virtual instructions close to machine instructions The simplest and most powerful scanning method uses a giant case statement. The simplest and most powerful parsing method Virtual instructions are defined and emitted that are mapped into uses language specific recursive descent because normal push down machine instructions. It is best to generate low level virtual instruc- stack based parser tables use very weak push down automata, even tions because the code generation human coder for the particular weaker than TMs. This is especially important in complicated lan- Verilog feature can craft good instructions sequences. Verilog op- guages such as Verilog where the scanner needs parser information erations at minimum require operating on an A part (one or more and the parser needs scanner information. The 1364 standard com- computer RAM words) for the 0 and 1 value and a B part for the mittee does not worry about constraining the language so that it X and Z value so even simple operations require multiple machine can be coded as a context free grammar. Verilog assignments are instructions and probably two by two vector cross product evalua- separate from expressions so simple and fast operator precedence tion. parsing can be used (Gries 1971) section 6.1, 122-132. This simula- The copy operation is an exception especially for Verilog be- tor can elaborate 5 million line designs is less than 15 or 20 seconds cause much simulation is copying data that can be very wide. Initial on a modern fast CPU. Fast elaboration speeds up development of code generation inserts a complicated virtual I COPY instruction code generator algorithms, allows faster compiler debugging and whenever there is a possibility of a need for a copy. Then during assists in finding faster machine code instructions patterns for the mapping from the I COPY virtual instruction to machine instruc- output machine language program. tions, the copy usually can be removed without needing to emit anything. 9.5 For complex languages first develop an interpreter 9.10 Avoid extra representations such as tuples and breadth The first step in fast compiled Verilog development requires devel- first transformations oping a 1364 standard compliant interpreted simulator so that compiler machine code execution can be regressed against the inter- One of the best simplifying and simulation speed increasing meth- preter standard. Once simulation is running, many speed improve- ods is depth first code generation. This method is intentionally the ments will become obvious (made visible) without needing to wait opposite of what (Morgan 1998), page 212 recommends. Morgan for a long design phase before experimental data is available. recommends breath first code generation with successive transformations to lower levels. It is much simpler to use depth first instruc- 9.6 Add interpreter wrapper capability tion generation of almost machine instructions because it allows the compiler writer to control machine code sequences and reduces There are I CALL ASLPROC and I CALL ASLFUNC virtual number of lines of compiler code. instructions that work with the normal unbounded register temps and define-use predominator algorithms, i.e. interface is no differ- 9.11 Use experimentation to find fast CPU instruction ent from a low level virtual machine instruction. This allows run- parallelism code sequences ning simulations with only some constructs compiled. During initial development, only narrow (less than 32 or 64 bit) variables were Once a working code generator was written and debugged, the best compiled. All wider expressions and assignments were evaluated speed improvement method was to set up shell scripts that would with wrappers. Then step by step wrappers were replaced by low run two different versions of the compiler on a speed test regres- level machine instructions. Some very complicated Verilog algo- sion suite and compare results. The low level machine instruction rithms such as multiple path conditional delay selection and switch sequence from the best result would then be used. The developers level simulation are still just wrappers. did not have access to the X86 64 multi-issue pipeline optimization rules documentation so they needed to run experiments. Maybe 9.7 Make writing virtual instruction generation same as the experimental approach is better in general because the rules interpreter execution are so complex. This was the most important simulation execution speed up idea. The other was compiling the change propagation and For example, Verilog binary operator evaluation starts with a wrap- scheduled event processing code into flow graphs. per, then is replaced by a version of the same procedure with calls to gen tn to create temporary registers, and start bblk to re- place if statements (see eval binary procedures in v ex2.c file near 10. Allen and Kennedy Hardware Simulation as line 6600) for which the code generation procedure is in file v - Abstraction Contradicted bbgen.c. This allows code generation to mostly involve replacing In the book Optimizing Compilers for Modern Architectures, C program evaluation statements with the corresponding temp reg- Randy Allen and Ken Kennedy argue that the task of optimiza- ister, basic block and virtual operation generation procedure calls. tion of hardware descriptions abstracts hardware descriptions to a Code generation for more complicated simulation operations such less detailed level (Allen and Kennedy 2002). Allen and Kennedy as scheduled event processing can be simplified this way also. write “Another way of saying this is that simulation speed is related to the level of abstraction of the simulated design more than 9.8 Modify Morgan’s dominator-based optimization anything else – the higher the level, the faster the simulation” (p. algorithms 624). For Verilog this claim is wrong. Low level exact Verilog and The book Building an Optimizing Compiler by Robert Morgan machine code detail modeling results in faster simulation because contains a concrete simplified method for optimizing flow graphs it allows maximum utilization of low level instruction parallelism and computing define-use data structures. Start with that and sim- that is built into modern microprocessors. The optimizations in- plify even more and modify to fit your complex language. This is clude: processing as wide a bit vector chunk at a time as possible, the hard very important part of developing a fast compiler. Once output instruction sequences to keep instruction pre fetch queues good flow graph optimizations are implemented, the register allo- and pipe lines full and output instruction sequences that work well cator becomes much easier and better. See optimize 1mod flow- with microprocessor branch prediction algorithms. Basic blocks in

5 2018/1/16 Verilog are small but very numerous because separate A parts and branch, but propagation of the changes can be expensive if the se- B parts require conditionals. lected part select bits used in right hand side expressions do not This compiler development project shows that a not very com- exactly match. It is much better not to change the Verilog HDL plicated code generator (less than 100,000 lines of C code) can pro- code, but to find fast instruction sequences. duce good code using the experimentation method of running speed regression test suites with different version of to be emitted instruc- 10.6 Two-State versus Four-State Logic, p. 637 tion sequences and choosing the fastest. The standard way (almost Two state simulation is good when it is possible. The problem is but not quite required by the IEEE 1364 standard) to store Verilog that the reason for hardware simulation is to find mistakes that 4 value logic and wire values stores values as two separate machine will result in unknown X (IC state that will be sometimes 1 and word areas. The B part selects unknown values: X (unknown) and sometimes 0) states in the fabricated integrated circuit. Most real Z (high impedance) values. If the B part is zero, the A part selects HDL designs do not allow two-state simulation, but if possible sim- 0 or 1 values. Bit and part select operations need to be optimized ulation is much faster because evaluations can directly use hard- as fast load, shift and mask operations that use mostly boiler plate ware machine instructions. Most four state evaluations are similar instruction sequences. Four value logic operations can also be gen- to evaluating two by two vector cross products. This simulator’s erated as combinations of machine logic instructions combining the low level design feature treats temporaries even for four state val- A and B parts. Although, complex logic operations are sometimes ues that require an A part word memory area and a B part word better implemented using separated A and B parts or even as table memory area as one temp. When an expression evaluation can be look ups. executed as two state, the B part instruction sequences are just not The remainder of this section criticizes specific ideas for imple- emitted or optimized away for expressions containing non two state menting simulation abstraction from the Allen and Kennedy book. elements during flow graph simplification. Verilog variables can be 10.1 Inlining modules, p. 624 declared as having a two state type. Expanding modules inline is called a fundamental optimization. 10.7 Rewriting Block Conditions, p. 637 Incorrect because it is much better to use the normal programming language optimization of not inlining but separating machine code Allen and Kennedy advocate changing trigger conditions on Ver- from state and variable data that are accessed using per instance ilog blocks by abstracting and guessing user intent to eliminate the based addressing using traditional display technology (Gries 1971) need to propagate change operators that in Verilog are usually event (pp. 172-175). Not only is code smaller but separating state from controls on always blocks. Instead of trying to rewrite the Verilog model description makes many optimizations visible especially source to higher level, it is better to compile event queue propa- ideas for pre-compiling event scheduling and processing. gation by generating flow graphs because most event controls are simple. Change operator occurrence scheduling and wake up event 10.2 HDL level execution ordering, p. 626 processing should also be compiled into flow graphs. State data should be first coded into the per instance ( idp) based display area Allen and Kennedy argue for HDL statement reordering optimiza- (see procedure alloc fill ctevtab idp map els near line 3820 of tions. Far more important is machine instruction reordering to max- file v prp.c). Then separate flow graphs are coded to schedule the imize low level microprocessor instruction parallelism. Part of the changes and to propagate the changes. The generated flow graphs reason for this is that although Verilog has fork-join constructs, are optimized the normal way so most change operators require they are rarely used. Parallelism in Verilog comes from a very large just a few instructions to schedule and a few instructions when the number of different always block usually triggered with an event flow graph is jumped to from the event queue processing code. The always @(clk) control (for example ). scheduling part for delay controls (@(clk) say) flow graph generation procedure is in the gen dce schd tev routine near line 10400 10.3 Dynamic versus static scheduling, p. 627 in file v bbgen3.c. This section discusses oblivious evaluation. Instead of the normal Verilog algorithm that “dynamically tracks changes and propa- gates those changes”, the alternative blindly evaluates without any 11. GPU or Other Multi-Core Parallelism Does change propagation overhead. The oblivious method can not work Not Work for Verilog for Verilog because it is common to have thousands of tick gaps between edges that have very high event rates. Static scheduling and There have been many projects that attempt to use specialized compiling events after change recording into pre-compiled flow computer hardware rather than optimizing flow graph compilers graphs to the extent possible is more important. to speed up Verilog simulation. The reason for this is that Verilog simulations are a significant consumer of electronic company com- 10.4 Fusing Always Blocks, p. 628 pute cycles with sometimes entire server farms dedicated to running only Verilog. Verilog compiled executable speed is of crucial Here Allen and Kennedy are correct. Fusing always blocks makes importance and has huge economic value. At least so far efforts to a huge speed improvement. speed up HDL simulation either with special purpose hardware or GPUs have failed. It is possible to build special purpose hardware 10.5 Vectorizing Always Blocks, p. 632 that will run unoptimized Verilog 10 times faster. The problem is The idea is that Verilog code generation optimization should “red- that a good flow graph based optimizing compiler run on a mod- erive the higher-level abstraction that was the original intent”. The ern fast multi-issue microprocessor can be 10 times or more faster problem with this is that the decision to code scalarized or vectored than the naive unoptimized computer software algorithms the spe- is best left to the designer (HDL generation program). Vectorizing cial hardware is compared to. is not always better because if only a few bits change per clock Hardware emulation uses a different approach in which a design cycle in a wide vector, it is better to simulate the scalarized indi- is converted to gate level and then fabricated into FPGAs (“layed vidual bits. There is no way to determine bit percentage switching out”). Emulated designs run much faster than software compiled frequency for a given simulation a priori. Change detection of vec- simulation so they are good for early software development, but do tor selects is fast because it only requires a mask, shift and xor then not “simulate” a design in sufficient detail for hardware debugging.

6 2018/1/16 The reason parallel Verilog simulation has not succeeded for A. Einstein. Geometry and experience, January 1921. URL most large system models is that Verilog HDL basic blocks tend http://www.relativitycalculator.com/pdfs/einstein_geometry_and_experience_1921.pdf. to by very small with a high proportion of jumps and synchro- Lecture before Prussian Academy of Sciences, Berlin. nizations. Small basic blocks parallelize efficiently on multi-issue A. Ertl and D. Gregg. The structure and performance of efficient inter- CPUs but have too much synchronization overhead for coarser par- preters. Journal of Instruction-Level Parallelism, 5, 2003. electronic allelism. archive journal. In the compiler described here, parallelism is used in one place. A. Ertl, D. Gregg, A. Krall, and B. Paysan. vmgen – a generator of efficient Value change dump file output (FST format is smallest and fastest) virtual machine interpreters. Software Practive and Experience, 32(3): that is used by wave form viewers (software oscilloscopes) to debug 265–295, 2002. hardware may be run on up to 2 additional cores to encode value P. Finsler. Uber die unabhängigkeit der continuumshypothese. Dialectica, changes. This use of parallelism can improve simulation times for 23(1):67–78, 1969. simulations that generate huge value change dump files by a factor D. Gries. Compiler Construction for Digital Computers. John Wiley and of two. Very large value change files are also written because they Sons, 1971. can be input into back end physical design tools that analyze the J. Hartmanis and J. Simon. On the structure of feasible computations. In value changes to optimize IC power usage. M. Rubinoff and M. Yovits, editors, Advances in Computers, volume 14, One area where parallel Verilog simulation has worked to some pages 1–43. Academic Press, 1976. extent is providing tools that guess and advise users how to parti- IEEE. Ieee std 1364-2005 verilog hardware description language reference tion designs among multiple X86 64 cores. Simulation can then be manual, 2005. run in parallel on multiple cores because the assignments to cores W. McKeeman, H. James, and D. Wortman. A Compiler Generator. Pren- results in minimal communication overhead. Parallel Verilog sim- tice Hall, 1971. ulation on GPUs has not succeeded. S. Meyer. A data structure for circuit net lists. In Proceedings 25th Design Automation Conference, pages 613–616. ACM/IEEE, June 1988. 12. Conclusions S. Meyer. Adding methodological testing to naur’s anti-formalism. In The general area of how to represent digital electronic circuits is IACAP 2013 Proceedings. Internation Association for Computing and currently rather unsettled. Many designers would prefer to write Philosophy, July 2013. programming language (C usually) code and have a program au- S. Meyer. Philosophical solution to p=?np: P is equal to np. arXiv, cs.GL, tomatically convert the code into a hardware design that is then March 2016. URL https://arxiv.org/abs/1603.06018. represented in Verilog (called high level synthesis). R. Morgan. Building an Optimizing Compiler. Digital Press, 1998. However, If a design really can be coded as a computer program P. Naur. Computing as science, pages Appendix 2, 208–217. naur.com and implemented using FPGA integrated circuit technology, it may Publishing, 2005. www.Naur.com/Nauranat-ref.html URL of Oct. be cheaper and better in speed, cost and power to implement the 2014. function in software as a highly optimized compiled low level P. Naur. Computing versus human thinking. Communication of the ACM, program on off the shelf multicore SoCs containing conventional 50(1):85–94, 2007. CPUs, signal processing CPUs, GPUs and other types of CPUs. P. Naur. Conversations - Pluralism in Software Engineering. Lonely Another way of expressing this observation is that if digital circuits Scholar Publishing, Belgium, 2011. Edgar Daylight ed. can be expressed using only two state logic with X and Z cross D. Ritchie. The development of the c language. In T. Bergin and R. Gibson, products ignored, implementation as computer programs may be editors, History of Programming Languages II, pages 671–698. ACM, better. April 1993. In my view there is a need for a simple Verilog like HDL that S. Sutherland, S. Davidmann, and P. Flake. System Verilog for Design. is intended to be generated by computer programs - maybe V– –. Springer, 2006. Verilog elaboration and use of constant parameters that are sometimes not really constant at run time makes sense when HDL de- D. Thomas and P. Moorby. The Verilog Hardware Description Language. Springer, 2002. signs are coded by hand because it makes Verilog easier to write, but not when Verilog is machine generated. If such a language N. Wirth. The programming language pascal. Acta Informatica, 1:35–63, eliminated non locality such as cross module references and force- 1975. release statements (originally intended to allow coding reset but- tons), much faster simulation might be possible. Such simulation could use some type of graph theory connectivity compilation. 13. Acknowledgments Andrew Vanvick(AIV in source) was the other CVC developer. References R. Allen and K. Kennedy. Optimizing Compilers for Modern Archictec- tures. Academic Press, 2002. Anonymous. A verilog simulator comparision, September 2011. URL http://www.semiwiki.com/forum/content/777-verilog-simulator-comparison.html. W. Aspray. John von Neumann and the Origins of Modern Computing. MIT Press, 1990. S. Budiansky. Blackett’s War: The Men Who Defeated the Nazi U-Boats and Brought Sciene to the Art of Warfare. Knopf, 2013. K. Cooper, T. Harvey, and K. Kennedy. A simple, fast dominance algorithm. Technical Report TR-06-33870, Rice University Department of Computer Science, 2006.

7 2018/1/16