Hardware Description Language Based on Message Passing and Implicit Pipelining
Total Page:16
File Type:pdf, Size:1020Kb
Hardware Description Language Based on Message Passing and Implicit Pipelining Dmitri Boulytchev Oleg Medvedev St.Petersburg State University St.Petersburg State University [email protected] [email protected], [email protected] Abstract depends on secret knowledge of implementation details We present a hardware description language (currently and subtle host platform properties. In the same time the called “HaSCoL”) which is based on both reliable and assistance of development tools for exploring the design unreliable message passing and implicit pipelining of space by source code transformations with predictable message handlers. The language consists of a small core results is extremely desirable. and a number of extensions, which cover many features Another inconvenience originates from the fact that of high level software languages as well as high level hardware and software are traditionally expressed us- hardware description languages (HDLs). These exten- ing different languages. Thus the problem of interfacing sions have simple projections into the core language and software and hardware components arises. At present allow compact and concise description of complex al- time no drastic solution for this problem is suggested. gorithms. The core language in turn can be converted We argue that these problems can be solved by de- into efficient VHDL. We discuss place-and-route results scribing both hardware and software using the same for some benchmarks implemented both in HaSCoL and general-purpose programming language (with hard- VHDL and suggest an optimization which should im- ware description as its proper sublanguage). We call prove the results significantly and make them close to this language “HaSCoL” (Hardware-Software Codesign those for hand-coded VHDL. Language). This paper presents our current state of the art: a pure hardware description language. 1. Introduction 2. HaSCoL in a nutshell The advances in hardware development technologies This section informally describes the language by exam- made the process of hardware design and implemen- ples. tation more ”software-like”. For example many algo- rithms can now be accelerated in orders of magnitude 2.1 Core features by partial mapping to FPGA (e.g. molecular dynamics simulation [6] or XML filtering [7]). These algorithms Any specification in HaSCol describes synchronous de- can be of various domains and therefore might be tough sign. Unlike VHDL or Verilog clock and reset (and some to implement using conventional hardware development auxiliary) signals are not presented explicitly but ex- techniques which were historically ad-hoc crafted to uti- press themselves by mean of relevant constructs seman- lize rather low-level tools for rather narrow set of prob- tics. lems. In particular, two general problems in the domain The example below demonstrates some features of of hardware-software codesign — Design Space Explo- the language using inline comments (which start with ration (DSE) and hardware-software decomposition — two dashes): can hardly be solved using conventional hardware de- −− − scription language (such as VHDL or Verilog) as the A 19 bit unsigned integer global variable with asynchronous −− read and synchronous write access; initial value 5 language of primary representation. −− is assigned on each reset. The main problem with those languages (let alone data count : uint(19) = 5; their low level) is that they can hardly be operated in −− A handler which starts on each cycle when messages arrive a formal, unmanned way. For example, hardware de- −− simultaneously from the channels ”chan1” and ”chan2”. velopment using VHDL involves a lot of pragmatics, in chan1(a : Type1), in chan2(b : Type2) f is directed by a number of programming patterns and −− In the first cycle of handler body we put the result of −− some computation (”a+b”) into intermediate register ”c” and −− Assembler syntax pattern: specified strings and −− increment the global counter in parallel. −− string representation of the parameters have to be c = a + b j count := count + 1; −− concatenated to produce assembler text for the −− In the second cycle we evaluate a condition (”c > 0”) −− instruction (e.g. ”r5 := r7”). −− and perform the first cycle of the ”else”−branch or looks f”r” dst ”:=” ”r” srcg −− ”then”−branch respectively. end if c > 0 then −− We send the value of register ”c” through the channel 2.2 Structural design and communication −− ”chan3”. If no handler is ready to consume the value −− the computation is suspended for one cycle. A typical design in HaSCoL consists of a hierarchy of send chan3(c) else blocks. Blocks are interconnected by channels, which −− We asynchronously compute expression ”not c” and store are used to transmit messages between handlers. −− the result in register ”d”. In parallel we send a message A single HaSCoL block is a set of handlers. Each −− to the ”notify” channel in an unreliable manner ( i . e . handler receives messages from some channels, pro- −− the computation continues regardless successful delivery). cesses received data and sends messages in a reliable d = not c j inform notify (); or non-reliable manner through other channels. −− We reliably send ”d” to the ”chan3” on the next cycle. Some of blocks may be written directly in VHDL send chan3(d) if one needs to instantiate some technology-specific fi VHDL components (e.g. a dual-port on-chip RAM). g A situation when two different messages are sent The example above demonstrates the two main fea- through the same channel in the same cycle is rather tures of HaSCoL — stallable pipelining and reliable common for complex designs. HaSCoL provides a con- message delivery. Pipelining means that a handler starts venient way to resolve such a conflicts by introducing to process a new incoming set of messages on each cy- the notion of ports, which can be specified for a chan- cle unless its first cycle is stalled. That is, if it takes more nel (in fact, if no ports are explicitly specified then one than one cycle to process a set of messages then many default port is assumed; strictly speaking a message can sets of messages can potentially be processed simulta- only be sent through some port of channel, not a chan- neously in a pipelined manner. nel itself). All ports of the same channel have different Reliable message delivery means that each “send” priorities assigned. Sending different values through the operator may stall a handler if no handler is ready to same port of the same channel at the same cycle results accept the message being sent. The stall is propagated in unpredictable clash of data, while sending through in opposite to the message flow direction so that no different ports does not. message is ever lost. This reliability mechanism may There are two priority assignment disciplines — completely block a handler. Since reliable delivery is static and fair. Static priorities are assigned once and not always desirable the language also has “inform” for all; fair priorities are reassigned in a circular manner construct, which never stalls but may loose messages. each time a message is consumed from the channel. The language provides many high-level constructs by mean of syntax extensions (see section 3). Below we 3. Language extension hierarchy demonstrate the most hardware specific one — proces- sor instruction description: An important HaSCoL feature, which facilitates various analyses and transformation into VHDL, is that HaS- −− The statement expresses a register−to−register assignment. CoL is not a single language but a hierarchy of lan- −− It matches an instruction code in ”cmd” with a binary pattern guages. Each hierarchy level is a syntax extension of the −− of the instruction and performs assignment. previous one. The base level is easily convertible into ef- match cmd with ficient VHDL. The hierarchy is shown on Fig. 1. −− The instruction has two parameters −−− numbers of general The core language level consists of constructs to ex- −− purpose registers of the processor. press only structural design, message delivery, straight- insn (src, dst : uint(3)) line multi-cycle code and one-cycle conditionals (no −− Instruction semantics. other control flow, no global variables etc.) This level does f regs[dst] := regs[src] g −− Binary pattern: specified bit strings and binary can be directly converted into synthesizable VHDL. −− representation of the parameters have to be The next level extends core language with global −− concatenated to produce a binary representation of variables. It is converted to the core language by rep- −− the instruction (e.g. 0x6F000000). resenting each global variable as a handler which holds coded f0b01 src dst 0x000000 g assigned value by permanently resending it to itself in Processor instruction set description erates synthesizable VHDL; assembler generator which 1 Coarse grain synchronization generates a binutils-compatible assembler from a pro- (monitors) cessor instruction set description. In this section we Control flow statements present evaluation of these tools on some benchmarks. (conditional, loop, First, we implemented a pipelined cubic polynomial pattern matching etc.) evaluator and a Fast Fourier Transform with a pipelined Global variables butterfly. Table 1 shows results of synthesis and place- and-route for a Virtex-4 xc4vlx15-12sf363 device. Im- Core language plementations written in HaSCoL and in pure VHDL are compared for each benchmark. For a case of the polynomial the difference in perfor- mance is rather minor. Analysis of mapping results for the FFT explains Figure 1. Language extension hierarchy the major difference in LUT numbers. The reason is an incompleteness of the VHDL generator. It generates Table 1. Evaluation results a huge amount of logic to support reliable message FFT polynomial passing even for the case, where only unreliable one VHDL HaSCoL VHDL HaSCoL is used (like in FFT). This defect can be fixed with an ∗ Lines of code 531 290 95 18 appropriate code analysis and we will do this soon. 4-input LUTs 566 2320 129 162 Note that the difference in description simplicity is RAMB16s 40 40 0 0 major for both cases.