Software Tools for Technology Transfer manuscript No. (will be inserted by the editor)

Utilizing Parametric Systems For Detection of Hazards

Luka´sˇ Charvat´ · Alesˇ Smrckaˇ · Toma´sˇ Vojnar

Received: date / Accepted: date

Abstract The current stress on having a rapid development description languages [14,25] are used increasingly during cycle for featuring pipeline-based execu- the design . Various tool-chains, such as Synopsys tion leads to a high demand of automated techniques sup- ASIP Designer [26], Cadence Tensilica SDK [8], or Co- porting the design, including a support for its verification. dasip Studio [15] can then take advantage of the availability We present an automated approach that combines static anal- of such descriptions and provide automatic ysis of data paths, SMT solving, and formal verification of generation of HDL designs, simulators, assemblers, disas- parametric systems in order to discover flaws caused by im- semblers, and compilers. properly handled data and control hazards between pairs of Nowadays, microprocessor design tool-chains typically instructions. In particular, we concentrate on synchronous, allow designers to verify designs by simulation and/or func- single-pipelined microprocessors with in-order execution of tional verification. Simulation is commonly used to obtain instructions. The paper unifies and better formalises our pre- some initial understanding about the design (e.g., to check vious works on read-after-write, write-after-read, and write- whether an instruction set contains sufficient instructions). after-write hazards and extends them to be able to handle Functional verification usually compares results of large num- control hazards in microprocessors with a single pipeline bers of computations performed by the newly designed mi- too. The approach has been implemented in a tool called croprocessor against a golden specification, which must be Hades, and we present promising experimental results ob- provided manually by the developers. tained using the tool on multiple pipelined microprocessors. However, even extensive functional verification can miss Keywords Microprocessor · Data Hazard · Control some non-trivial bugs. Therefore, usage of formal verifica- Hazard · Formal Methods · Parametric Systems tion becomes more and more desirable in the recent years. As opposed to testing and bug-hunting techniques that aim at detection of flaws, the goal of formal verification is to rig- 1 Introduction orously prove that the system is indeed correct. That is, if no issue is found by a formal method, the system is guaranteed As the complexity of hardware designs is growing over the to conform to the given specification. last decades and microprocessor development cycles are fac- Formal verification is, however, a very demanding task, ing pressure on fast and low design costs, automation of and even though a lot of progress has been achieved in this hardware development has become a crucial need. This trend area, formal verification is far from being able to fully auto- was further boosted by the recent rise of the so-called In- matically check all relevant properties of complex designs. ternet of Things (IoT) interconnecting embedded devices, Typically, a lot of human intervention is needed in extracting which often require specialised, power-efficient designs to parts of the designs to be verified, constructing abstractions be used. To facilitate the automation, specialized of their environment, and/or helping the automated verifica- L. Charvat´ · A. Smrckaˇ · T. Vojnar tion tool in the verification itself. The latter help can come, Brno University of Technology, Faculty of Information Technology, e.g., in the form of leading a proof when using interactive Bozetˇ echovaˇ 2, 612 66 Brno, Czech Republic theorem provers, choosing and fine-tuning the abstraction E-mail: {icharvat,smrcka,vojnar}@fit.vutbr.cz and verification methods used by model checkers or static 2 L. Charvat´ · A. Smrckaˇ · T. Vojnar analysers, and/or checking possible false alarms produced tion 8 presents an experimental evaluation of the proposed by such tools. approach. Finally, Section 9 concludes the paper. One way how to mitigate the complexity of automated formal verification is to develop methods specialised in ver- 2 Related Work ifying specific features of hardware designs, which is the ap- proach that we take in this work too. The main idea is that, Showing the absence of pipeline hazards is a native part this way, a high degree of automation and scalability can be of checking conformance between a register transfer level achieved since only parts of a design related to a specific (RTL) design and a formally encoded instruction accurate error are to be investigated. (ISA) description. The perhaps most cited approach to such We, in particular, concentrate on checking the absence checking is the so-called flushing technique [7], which has pipeline-based execution of certain kinds of errors in of in- been extended, e.g., in [18,27,16], to handle rather compli- structions in custom-built microprocessors. To be more pre- cated designs with multi-cycle execution units, exceptions, cise, we restrict ourselves to synchronous microprocessors and branch prediction. The main challenge of these works with in-order execution of instructions that do not contain is to overcome the semantic gap between the different levels analog components. Our focus on this area is motivated by of a processor description. Dealing with this issue typically that implementation of pipeline-based execution of instruc- requires a significant user intervention in the form of provid- tions in purpose-specific microprocessors with customized ing various additional assertions about the design (as can be data and control paths is a rather error-prone task, which seen, e.g., in the recent technical report [6]) or transforming implies a special need of proper verification of the resulting it to a purpose-specific description language. designs. In [17], the so-called self-consistency check that com- As our first step in the above area, we proposed a fully- pares possible executions of each instruction in two scenar- automated approach for checking whether pipelined micro- ios is introduced. The comparison is made wrt a property processors execute each instruction correctly provided that given by the user, e.g., a property concerning data hazards it is executing in isolation [9]. Later, we complemented the which deals with (i) executions of an instruction enclosed by approach by proposing a method for verifying the absence of any (random) instructions within the pipeline and (ii) execu- read-after-write (RAW) [11] and, subsequently, write-after- tions of the same instruction surrounded by NOP instructions read (WAR) and write-after-write (WAW) hazards [12] in only. If the self-consistency check succeeds, conformance of microprocessors with a single pipeline. In this paper, we the RTL and ISA descriptions of a processor can be estab- present a unified and better formalised approach of treat- lished by separately showing conformance of the RTL/ISA ment of all of the above mentioned hazard types. Moreover, descriptions of each individual instruction. The main draw- we extend the approach so that it can now address problems back of the approach is that it requires the enclosing instruc- that might be caused by control hazards as well. tions from the first run not to violate a so-called consistent Our approach starts by a static analysis of data paths state of the microprocessor, which has to be manually de- to detect anomalies and possible hazards. This analysis is fined by the user. followed by a transformation of detected problematic data In [1], a formal model based on a notion of stages, parcels paths to a parametric system: in particular, a system of a para- (instructions), and hazards has been introduced. Once the metric (and hence not known in advance) number of instruc- user defines predicates needed for describing the pipeline, tions to be executed over the detected path, which can be the design can be automatically formally proven correct un- viewed as a linear array of concurrent processes. Finally, der a correctness criterion given in the work. Another, a bit we verify whether the detected potential hazards are real by similar approach has been proposed in [19]. The approach means of techniques for formal verification of parametric introduces an abstract formal model whose components are systems. We have implemented our approach in a tool called to be linked by the user with the concrete cycle-accurate im- Hades [10] and present encouraging results from its exper- plementation through a number of mappings. Afterwards, imental evaluations on multiple pipelined microprocessors, interval property checking [23] is used to check several prop- including real-life ones. erties implying correctness of the pipeline behaviour. Again, both of the above methods require a significant manual user Plan of the Paper Section 2 presents an overview of the re- intervention. lated work addressing validation of single-pipeline micro- Compared with the above approaches, we do not aim processors. Section 3 defines the needed notions. In Sec- at full conformance checking between RTL and ISA im- tion 4, we sketch the main idea of the proposed approach. plementations. Instead, we address one specific property— Sections 5 and 6 discuss pre-processing tasks that are needed namely, the absence of problems caused by data and con- before the core steps of our verification approach are ap- trol pipeline hazards. On the other hand, our approach is al- plied. These core steps are then described in Section 7. Sec- most fully automated—the only step required from the user Utilizing Parametric Systems For Detection of Pipeline Hazards 3

i is to identify the architectural resources (such as registers – There are arbitrarily many outbound data-out edges eq ∈ i and memory ports) and the program . E such that s(eq) = (vr, q) where 0 ≤ i < n for some n ∈ N. (For n = 0, we get a register that is only written.) – There is exactly one inbound clear edge erst ∈ E, also 3 Preliminaries denoted as the synchronous reset edge, s.t. t(erst) = (vr, rst). We now introduce various basic notions that we will build – There is exactly one inbound enable edge een ∈ E s.t. on in the rest of the paper. t(een) = (vr, en). Next, for each , the following criteria must be satisfied. 3.1 Processor Structure Graphs – For each circuit vg ∈ Vg implementing a Boolean func- In what follows, we expect a processor to be described in the tion g(a0, . . . , an−1), there is exactly one inbound edge form of a so-called processor structure graph (PSG) which for each argument of g such that t(eai ) = (vg, ai) for all can be represented by a tuple G = (V, E, s, t, ω). Here, V 0 ≤ i < n where n ∈ N. (For n = 0, we get a constant is a finite set that is the union Vr ∪ Vf of a set Vr of reg- function without parameters.) isters and a set Vf of Boolean circuits, Vr ∩ Vf = ∅. We – Every vmx ∈ Vmx that implements a case distinguish two types of registers: namely, architectural reg- selection function (sel, case0, . . . , casen−1) has isters Va and pipeline registers Vp such that Vr = Va ∪ Vp exactly one inbound edge for each of its arguments such and V ∩ V = ∅. The set of architectural registers V must a p a that t(esel) = (vmx, sel) and t(ecasei ) = (vmx, ci) for always contain exactly one vpc. We expect all 0 ≤ i < n where n ≥ 2. all registers to have a unit write and zero read delay. Longer – For each circuit vf ∈ Vf , there are arbitrarily many out- i i access times (e.g., for memory ports) can be modelled by bound result edges eq ∈ E such that s(eq) = (vf , q) introducing sequentially connected registers emulating the where 0 ≤ i < n for some n ∈ N+. required delay. Boolean circuits represent common combi- – There is no cycle in the graph consisting of vertices rep- national logic circuits. For the rest of the paper, it is suffi- resenting Boolean circuits only. cient to distinguish these circuits into Vmx and Finally, there are no other types of edges other than the ones all other circuits Vg, referred to as generic circuits further on. described above. Hence, we let Vf = Vmx ∪Vg while requiring Vmx ∩Vg = ∅. Due to the above restriction to at most one inbound edge For registers, we use a well-known notation to character- for a single connection type, one can use a simpler nota- ize their connections: namely, we use d to denote the data-in, tion to uniquely describe the edges. In particular, an edge q data-out, rst reset, and en write-enable connections. For e ∈ E that satisfies t(e) = (v, c), v ∈ V , c ∈ , can multiplexers, we denote by sel the inbound connection that T be encoded using the expression v.c. Finally, the function is the selector which selects one of the input cases ci to be ω : E → + represents a mapping that assigns some bit- transferred from the input to the output of the multiplexer, N width to all edges of the PSG. The mapping can be naturally which is again denoted as q. We denote input connections expanded to be defined over registers too—namely, we let of generic Boolean circuits as generic inputs ai. As before, ω(v ) = ω(v .d) for all v ∈ V . Additionally, it must also the output is denoted as q. Together, this gives us the set r r r r hold that ω(eout ) = ω(vr.d) for any (vr, eout ) ∈ Vr × {e ∈ T = {d, q, rst, en, sel} ∪ {ai, ci | i ∈ N} of all connec- E | s(e) = (v , q)}. tion types. r Since we propose the notion of PSGs to be as simple as Next, we use E to denote a finite set of transfer edges. possible, it does not take into account memories and mem- Note that we do not define the set of edges as E ⊆ V × V ory ports. Instead, it contains architectural registers, which since we sometimes need more edges between two nodes. can be used to represent particular memory cells. In the pa- Instead, we simply require that E is a finite set of some ab- per, we assume that a memory is modelled using a finite stract edges, and we assign each edge with its source, tar- number of architectural registers representing the cells of get, and type. Namely, we use s : E → V × T to assign the memory. Memory ports are then modelled using addi- to each edge its source vertex and its connection type, and tional logic circuits that select the appropriate memory cell t : E → V × T to assign to each edge its target vertex and using its address. In particular, for a memory with n address- its type of connection. able units, there are architectural registers m , . . . , m ∈ The sets V and E and the functions s and t must fulfil 0 n−1 Va. A read memory port of such a memory is modelled us- the following criteria. For each register vr ∈ Vr: ing a single multiplexer circuit vread ∈ Vmx connected to – There is exactly one inbound data-in edge ed ∈ E such each of the registers representing memory units—for each that t(ed) = (vr, d). mi, 0 ≤ i < n, there is an edge e = vread.ci connect- 4 L. Charvat´ · A. Smrckaˇ · T. Vojnar ing a multiplexer case with the corresponding memory unit Since our approach builds on analysing conditions that s(e) = (mi, q). The selector edge vread.sel then repre- hold in certain stages of the execution of a given instruc- sents a memory address and vread.q represents the data- tion, we now introduce a notion of edge conditions. An edge out connection of the memory port. A write memory port condition is a pair (e, b), denoted e b, meaning that the is modelled by n circuits used to enable writing to a given edge e ∈ E transfers some value b ∈ Bω(e). By E, we de- memory-cell mi, 0 ≤ i < n. Each of these circuits imple- note the set of all such edge conditions. Further, we define ments a Boolean function (sel = i) ∧ en, 0 ≤ i < n, where a mapping γ : E → 2C that assigns each edge condition sel represents a memory port address and en enables writing (e b) ∈ E the set of configurations from C in which the to the memory. A schematic of a write and a read memory edge e transfers the value b, i.e., γ(e b) := {c ∈ C | port is depicted in Fig. 1. c[e] = b}. Given a set K ⊆ E, we also use the point-wise T extension γ(K) := k∈K γ(k) of γ. 3.2 The Transition System Induced by a PSG

Let B = {0, 1} be the set of Boolean values, and let Bn 3.3 Parametric Systems denote the set of bit-vectors of size n ≥ 1. A PSG G = (V, E, s, t, ω) induces a (finite) transition system (C ,,→) When checking presence of hazards in a processor repre- whose set of states C = N ω(v) is the set of con- sented by a PSG, our approach derives as an intermediate v∈Vr B figurations of the PSG G and whose transition relation ,→ ⊆ model a parametric network of processes operating on a lin- 1 C ×C is defined later in this section. We use c[vr] to denote ear topology, which we denote as a parametric system in the bit-vector value of the register vr ∈ Vr in a configuration what follows. In particular, we use a common notion of para- c ∈ C . We abuse the notation and write c[e] to denote the metric systems where processes (which, in our case, repre- value transferred over an edge e ∈ E in the configuration c sent instructions being executed) may perform local transi- as well. Given an edge e ∈ E such that s(e) = (vf , q) where tions as well as universally or existentially guarded global vf ∈ Vf is a circuit computing a function fn(a0, . . . , an−1), transitions [13,22,2]. Formally, a parametric system (PS) n ∈ N, the value of c[e] can be recursively expressed as is a pair P = (Q, ∆) where Q is a finite set of states of a process and ∆ is a set of transition rules over Q.A global c[e] = fn(c[ea0 ], . . . , c[ean−1 ]) where eai ∈ E, 0 ≤ i < n, 0 corresponds to the edge of the i-th parameter of the function transition rule is of the form Q◦ : G |= q → q where fn. In the case that an edge e ∈ E is an outbound edge of Q ∈ {∀, ∃}, ◦ ∈ {← “left”, → “right”, ↔ “left or right”}, 0 a register vr ∈ Vr, i.e., s(e) = (vr, q), we let c[e] = c[vr]. G ⊆ Q, and q, q ∈ Q.A local transition rule is then of the 0 0 For each register vr ∈ Vr of a bit-width m, m ≥ 1, we form q → q where q, q ∈ Q. next-state function f next : (2·m+2) → A PS induces an infinite transition system whose config- assume the standard vr B m B where the register vr is written a value transferred over urations are finite non-empty words over Q, i.e., elements + the vr.d edge iff the vr.rst edge transfers “0” and vr.en of the set Q := {q1 . . . qn | 1 ≤ i ≤ n ∧ qi ∈ Q}. For transfers “1” in the given configuration. Next, the value of any index 1 ≤ i ≤ n, the i-th process can change a configu- 0 the register vr is nullified if the vr.rst edge transfers “1”. ration q1 . . . qi . . . qn to a configuration q1 . . . qi . . . qn when 0 In the following, we will refer to such a transition as regis- it goes from its state qi to qi using some of the transition ter clearing. Finally, the register vr keeps the same value if rules. The rule can be applied only if its guard is satisfied. both vr.en and vr.rst transfer the value of “0”. This will For example, the meaning of the guard ∃↔ : G is “for each be referred as register stalling in the following explanation. state q from the set G, there should be at least one process f next in the linear topology including the current one so that the When put together, the next state function vr can be for- mally defined as follows: process is in the state q”. Formally, the guard ∃↔ : G is sat- isfied in the configuration q1 . . . qi . . . qn by the i-th process curr en = 0 ∧ rst = 0,  iff ∀q ∈ G ∃1 ≤ j ≤ n : qj = q. Similarly, the mean- f next (curr, new, en, rst) := new en = 1 ∧ rst = 0, vr ing of the guard ∃← : G is “for each state q from the set G,  0 otherwise. there should be at least one process to the left of the current

0 0 one so that the process is in the state q”. Formally, the guard Then, the relation ,→ contains a transition c ,→ c iff c [vr] = next ∃← : G is satisfied in the configuration q1 . . . qi . . . qn by the fv (c[vr], c[vr.d], c[vr.en], c[vr.rst]) for all vr ∈ Vr. r i-th process iff ∀q ∈ G ∃1 ≤ j < i : qj = q. The meaning 1 Note that we do not introduce any notion of initial states of the of the other guards is defined analogically. transition system. This is because we will use the notion of transition In the following, we assume working with PSs equipped systems to help us explore which sequences of transitions of the system with a fairness assumption requiring that every process must (C,,→) are possible while restricting the exploration to relevant states externally to the notion of transition systems (as we will later see in perform a step during a transition between two configura- Section 7.2). tions. Intuitively, this corresponds to the fact that all the Utilizing Parametric Systems For Detection of Pipeline Hazards 5

.en .d

Write Port .sel Read Port

.a0 Cmp0 0 .d .c .c .q 0 1 m0 .en .q 0 MxSel

.c0 ...... MxMem .sel ... .c .sel .a n-1 0 .d

Cmpn-1 n-1 .c .q 1 mn-1 .en

0 MxSel .c0 .sel

Fig. 1: A schematic of a write and a read memory port. instructions represented by processes must make a (possi- types of the hazards and on CPU designs that do not use bly idling) step in their execution. Moreover, we expect that out-of-order execution. We will now give informal defini- global transition rules have a higher priority than the local tions of each of the considered hazard types, which we will ones. later formalize in Section 6. We will reduce the problem of checking existence of a hazard in a given PSG to a reachability problem given by Definition 1 A read-after-write (RAW) data hazard is a sce- a PS P , a regular set I ⊆ Q + of initial configurations, and nario in which a later-started instruction uses data supposed a regular set Bad ⊆ Q + of bad configurations. In particu- to be produced by an earlier-started instruction, but the ear- lar, we will define Bad as the upward closure of a finite set lier instruction has not yet managed to proceed far enough B ⊆ Q + of minimal bad configurations where some of the in the pipeline to write the data into the register used by the instructions being executed got to an undesirable state. This later instruction. The later instruction then stores a poten- is, Bad = {c ∈ Q + | ∃b ∈ B : b v c} where v is the tially wrong result of its execution, obtained by dealing with obsolete data, into some programmer-visible register. usual sub-word relation (i.e., u v s1...sn ⇔ u = si1 ...sik for some 1 ≤ i1 ≤ ... ≤ ik ≤ n, 1 ≤ k ≤ n). Now, let R ⊆ Q + denote the set of all reachable configurations of Definition 2 A write-after-read (WAR) data hazard is a sce- the given PS P , i.e., the set of all those configurations that nario in which some data that should be used by an earlier- can be reached from some initial configuration from the set started instruction are overwritten by a later-started instruc- I by applying a finite sequence of the transition rules such tion before the earlier instruction manages to read the data. that the fairness assumptions are respected. We say that the The earlier instruction then stores a potentially wrong result PS P is safe wrt I and Bad iff no bad configuration is reach- of its execution, obtained by dealing with data seemingly able, i.e., R ∩ Bad = ∅. coming from the future, into some programmer-visible reg- ister.

3.4 Data and Control Hazards Definition 3 A write-after-write (WAW) data hazard is a sce- nario in which an earlier-started instruction overwrites the Hazards in the instruction pipeline of central processing units result of a later-started instruction that is stored in some (CPUs) are problems caused by inadequate synchronisation programmer-visible register, which then ends up containing of earlier and later instructions running concurrently through obsolete data. the pipeline that may cause potential corruption of the data used by the instructions, with some result of the compu- Definition 4 A control (CTL) hazard is a scenario where an tation that referred to such data eventually propagated to earlier-started control-flow instruction changes the flow of a programmer-visible register [24]. Three common types of the control, but some later, speculatively-started instruction hazards are data hazards, control hazards, and structural haz- manages to store some data into a programmer-visible reg- ards. In this article, we will further focus on the first two ister. 6 L. Charvat´ · A. Smrckaˇ · T. Vojnar

In commonly used in-order execution designs, the above and load/store instructions. In order to keep the PSG eas- specified hazards are eliminated by pipeline stalling and/or ily readable, types of connections are shown for architec- operand forwarding. For pipeline stalling, it is necessary for tural registers and case-c edges of multiplexers only. Also, a processor to be equipped with a control logic that deter- since enable (i.e., “en”) and clear (i.e., “rst”) connections mines whether a hazard could/will occur. If such a situation for pipeline registers4 are common for each stage, they are is detected, the control logic inserts no-operation (NOP) in- left out up to the ones that are required in the further expla- struction, sometimes called bubble, into the pipeline. There- nation. fore, before the later instruction from the pair of instructions In the CPU, the computation starts in Stage 1 by using which would cause the hazard executes, the earlier one will the content of the program counter PC to address the ith cell have sufficient time to proceed far enough in the pipeline so of the program memory Progi. An instruction fetched from that the hazard does not happen. the program memory cell is stored into the register IdIr that In the case of operand forwarding, additional (redun- represents the so-called fetch register. The fetched instruc- dant) data-paths are introduced into the . tion word in IdIr is then decoded by an instruction decoder These data-paths are aimed to provide an option to prop- in Stage 2. Boolean circuits that belong to the decoder are agate partially computed data2 from an earlier instruction shown in yellow. Next, an address stored in the index regis- to a later one in order to minimize the number of NOP in- ter is used to fetch data from the jth cell of the data memory structions that would otherwise have to be inserted using the Memj in Stage 3. Optionally, the can be auto- above mentioned stalling technique. incremented. The auto-incrementation logic is a feature al- lowing for an early incrementation of the value of a register for memory addressing just before or right after it is read. 4 The Proposed Approach to Hazard Detection We then speak about the so-called pre-/post-increment, re- spectively. The auto-incrementation feature usually brings Our approach for verifying that the pipeline logic prevents a more efficient execution of sequences of instructions that hazards consists of the following steps: (i) a simple data- access the processor’s memory (for instance, when comput- flow analysis intended to distinguish particular stages of the ing over long arrays or other juxtaposed data). This speed- pipeline, (ii) a consistency check to make sure that the flow up results from removing the need of otherwise required logic guarantees an in-order execution of instructions through pipeline stalls. However, the feature also introduces poten- the identified pipeline stages, (iii) a static analysis deriving tial WAW and WAR hazards that must be handled properly. constraints over data-paths of instructions that can poten- Finally, in Stage 4, the decoded opcode part of the instruc- tially cause a pipeline hazard, (iv) generation of a parametric tion is used to determine the type of an ALU operation (with system modelling mutual interactions between potentially the ALU itself colored in purple) and to select destination conflicting instructions allowed by the derived constraints, registers by setting their enable connection “en” to logical and (v) an analysis of the constructed parametric system to “1”. see whether the identified interactions may lead to a hazard. The Boolean circuit Flow in Fig. 2 represents the flow We assume the processor under verification to be rep- logic of the second pipeline stage. This logic is responsible resented using a PSG, which can be easily obtained from for dealing with WAR hazards on the index register X. The a description of the processor on the register transfer level flow logic implements the function (RTL) written in common hardware description languages, such as VHDL or Verilog. Flow(IncX , OfWrMem) := ¬IncX ∨ ¬OfWrMem. Example 1. Throughout the following sections, we will be illustrating the different steps of our approach on a running In case a later instruction wants to perform an auto-increment example depicted in Fig. 2. The figure shows a PSG de- of the index register X while an earlier instruction is going scribing a part of a simple microprocessor with an accu- to use the content of X for a memory write, the flow logic mulator architecture with the following architectural regis- uses the enable “en” and clear “rst” signals of pipeline reg- ters: X (a memory index register), A (an accumulator), PC isters to insert a pipeline bubble between the instructions (the program counter), Progi (program memory cells), and into Stage 3. / Memj (data memory cells) where 0 ≤ i ≤ `, 0 ≤ j ≤ k and k, resp. `, are the sizes of the memories3. The de- picted part of the CPU is used when executing arithmetic 5 Preprocessing a Processor Structure Graph 2 The data that have not been written to its final register. 3 We assume that the Impl, Or, and Not vertices of the PSG com- This section describes the first two steps of the proposed ap- 2 2 pute the standard implication fimpl : B → B, disjunction for : B → proach: namely, the data-flow analysis identifying pipeline B, and negation fnot : B → B functions. That is, for instance, 4 fimpl (a0, a1) := a0 ⇒ a1 for a0, a1 ∈ B. For a full list of pipeline registers, see Table 1 in Section 5.1. Utilizing Parametric Systems For Detection of Pipeline Hazards 7

Stage 1

Or' .c .c 1 0 .en .ci MxPC IncPC PC MxProg Progi .sel .sel .d Stage 2 IdIr En .a0 IncX Flow

.a1 .a1

.a0 Jmp WrA WrX Alu Op Inc Impl WrMem .a0

.a0 .c0 .c1 .a1 MxInc Or ... Not' .sel

Stage 3 .d .en Rst OfJmp OfWrA OfWrX OfAlu OfOp X OfWrMem

.cj MxMem .sel Stage 4

.a0 ExJmp ExWrA ExWrX ExAlu ExOp ExMem Cmpj Zero ExWrMem

.c1 .c0 .sel .a1 .sel Eq MxSelj MxJmp .c0 .c0 .sel .sel .en .a .c 0 .c 1 1 .en Legend: .c0 .c1 A

.d .c MxOp .c MxAlu 2 2 .a1 ... Memj Arch. Storage Zero' Sign Add .d ... Decoder .a0 ALU Stage 5

Fig. 2: A processor structure graph of a part of a CPU with an accumulator architecture. stages and the pipeline consistency check ensuring a proper computed stages forward through the PSG. If several stage in-order execution of instructions within the pipeline. values are propagated to a single vertex or edge, the min- imum is taken. Whenever a propagated stage value passes a register, it is incremented by one. If there is a register with

5.1 Data-Flow Analysis Discovering Pipeline Stages no path originating from the program counter, such as Progi in Fig. 2, then its stage is derived from the lowest stage that The input of the proposed verification method consists of reads from that register. For instance, Progi is only read by a PSG and a list of its architectural registers, including the IdIr and so Progi will be placed to the stage preceding the program counter vpc. On this input, the method starts by one where IdIr is located. The analysis gives us a mapping a simple data flow analysis whose goal is to compute the ϕ: V ∪E → S, S = {1, . . . , n}, n ≥ 1, which maps vertices number of pipeline stages. We then map registers, logic func- and edges of the PSG to pipeline stages. tions, and edges of the PSG into the pipeline stages. We define a pipeline stage as the sub-graph of the PSG that is responsible for executing a single-cycle step of an instruc- Subsequently, we derive the so-called write stage map- tion. The pipeline stage that an edge or a vertex (represent- ping ϕwr : V ∪ E → 2S that maps each vertex or edge to ing a register or circuit) of a PSG belongs to is given by the the set of stages that directly influence its value. Namely, minimum number of cycles needed to propagate data from we include into ϕwr(x) the stage of every pipeline register the input of the program counter to the edge or the output vp ∈ Vp from which there is a path to x that does not pass of the given vertex, respectively. Hence, as a particular case, through any further register from Vp. Likewise, we derive the program counter itself belongs to Stage 1. the read stage mapping ϕrd : V ∪ E → 2S for each vertex or The data-flow analysis that we use starts from the pro- edge that describes which stages are directly influenced by rd gram counter vpc and its Stage 1 and propagates the so-far its value. In particular, we include into ϕ (x) the stage of 8 L. Charvat´ · A. Smrckaˇ · T. Vojnar

Table 1: Registers of the CPU from Fig. 2 and the corre- out some extreme ways of pipeline implementation allowed sponding pipeline stages. by the original rules. An example of such a situation is an optimization of the execution during stage stalling when an Register Stage Write stages Read stages Pivot instruction preceded by a series of NOP instructions is al- ϕ ϕwr ϕrd lowed to proceed to the next stage in order to increase the throughput. PC 1 {1, 2, 3, 4}{1, 2} – For the following, assume a transition system (C ,,→) Prog 1 ∅ {2} i – induced by the PSG being verified. We introduce mappings X 3 {2, 3, 4}{3, 4, 5} – C st, rst : Vp → 2 defined as A 5 {4}{1, 2, 3, 4, 5} – st(vp) := γ({vp.en 0, vp.rst 0}), Memj 5 {4}{4} – rst(vp) := γ(vp.rst 1). IdIr 2 {1, 2, 3, 4}{1, 2, 3} X OfJmp 3 {2, 3, 4}{1, 2, 3, 4} X Intuitively, for any register vp ∈ Vp, st(vp) and rst(vp) are OfWrA 3 {2, 3, 4}{4} × the sets of configurations in which vp is stalled or cleared, respectively. The pipeline consistency rules that we check OfWrX 3 {2, 3, 4}{1, 2, 3, 4} X are then the following: OfAlu 3 {2, 3, 4}{1, 2, 3, 4} X OfOp 3 {2, 3, 4}{1, 2, 3, 4} X – Rule 1: If some pipeline register of a stage s ∈ S is OfWrMem 3 {2, 3, 4}{1, 2, 3, 4} stalled, then all pipeline registers of the Stage s have to X 0 be stalled, i.e., for all vp, vp ∈ Vp: ExJmp 4 {3, 4}{1, 2, 3, 4} X 0 0 ExWrA 4 {3, 4}{5} × ϕ(vp) = ϕ(vp) ⇒ st(vp) ⊆ st(vp). ExWrX 4 {3, 4}{1, 2, 3} X The rule follows the idea that an instruction carried by ExAlu 4 {3, 4}{1, 2, 3, 5} X a pipeline stage cannot be fragmented. The rule also re- ExOp 4 {3, 4}{1, 2, 3, 5} X flects one of the fundamental assumptions about pipe- ExWrMem 4 {3, 4}{1, 2, 3, 5} X lined execution from [20]: namely, at any given time, an ExMem 4 {3, 4}{3, 5} X instruction is always in a single pipeline stage only. As 0 a corollary, by simply swapping vp and vp, one can de- 0 rive a stronger statement ϕ(vp) = ϕ(vp) ⇒ st(vp) = v ∈ V 0 every pipeline register p p to which there is a path from st(vp). x that does not pass through any other register from Vp. – Rule 2: If some pipeline register in a Stage s ∈ S \ Pipeline stages of the registers from the PSG of Fig. 2 {max(S)} is stalled, then all pipeline registers of the and the corresponding read and write stages, computed as Stage s + 1 have to be stalled or cleared, i.e., for all 0 described above, are shown in Table 1. (The notion of pivots vp, vp ∈ Vp: will be introduced later on.) 0 0 0 ϕ(vp) = ϕ(vp) − 1 ⇒ st(vp) ⊆ st(vp) ∪ rst(vp). This rule is a rephrased version of Equation (15) from [20] 5.2 Pipeline Consistency Checking and prevents duplication of an instruction. The second step of our approach is consistency checking – Rule 3: If some pipeline register in a Stage s ∈ S\{1} is which checks whether the flow logic assures a correct in- stalled, then all pipeline registers of the Stage s − 1 have 0 order execution of all instructions through all the identified to be stalled, i.e., for all vp, vp ∈ Vp: pipeline stages. This means that all instructions which are 0 0 ϕ(vp) = ϕ(vp) + 1 ⇒ st(vp) ⊆ st(vp). fetched from the program memory should flow from the first stage to the last stage while maintaining their execution or- This rule is a rephrased version of Equation 16 from [20] der with no loss or duplication of an instruction. To check and prevents an instruction to be lost. the above, we verify whether the flow logic obeys a set of – Rule 4: If some pipeline register in a Stage s ∈ S is rules which expresses how the control connections (en, rst) cleared, then all pipeline registers of the Stage s have to 0 of registers in adjacent pipeline stages should be set. In par- be cleared, i.e., for all vp, vp ∈ Vp: ticular, we use a strengthened variant of the rules proposed 0 0 ϕ(vp) = ϕ(v ) ⇒ rst(vp) ⊆ rst(v ). in [20]. The rules have been strengthened since (as we will p p see later on) our approach builds on an assumption that, if Similarly to Rule 1, this rule prevents fragmentation of some pipeline stage is stalled, then all predecessor stages an instruction and it is a part of the basic assumptions have to be stalled as well. This means that our approach rules about pipelined execution mentioned in [20]. Utilizing Parametric Systems For Detection of Pipeline Hazards 9

We check the above rules using an SMT solver [5,21, 6 Static Detection of Potential Pipeline Hazards 3] for the bit-vector logic. To convert the rules into the bit- vector logic, we first define an operator ? that maps edges of According to Definitions 1–4, a pipeline hazard (of any of a PSG to variables of the bit-vector logic (BVL) such that the discussed kinds) occurs when two instructions access the ? ? e1 = e2 ⇔ s(e1) = s(e2) for each e1, e2 ∈ E. Intuitively, same architectural register and at least one of the accesses is edges with the same source must have the same value. Then, a write. We will further use the term spoiler whenever refer- for any e ∈ E, we define a BVL formula ψ(e) that encodes ring to the writing instruction causing the hazard. The other how the value transmitted over e is computed from values involved instruction will then be called a victim instruction. stored in registers. The formula ψ(e) is recursively defined Finally, we will speak about a hazard case when referring to as the pair formed by a spoiler and a victim instruction. In this section, we will first focus on identifying a finite  m ? ? ? V s(e) = (v, q) ∧ set of hazard cases potentially causing hazards in a given e = g(e1, ..., em) ∧ ψ(ei) ψ(e) := i=1 v ∈ Vf , processor. For that, we will use a static hazard analysis ex- true otherwise amining the PSG and pipeline stage mappings ϕ, ϕwr, ϕrd determined by the data-flow analysis from Section 5.1. In or- where g denotes the Boolean function computed by the cir- der to be able to describe a spoiler-victim pair forming a haz- cuit v ∈ Vf . ard case, we will introduce several auxiliary notions, among 0 Now, the inclusion test st(vp) ⊆ st(vp) from Rule 1 can which the so-called forward execution, minimal transfer ex- be reduced to checking validity of the following formula: ecution, and maximal store execution are the most impor- tant. 0 Φ(vp) := (ψ(vp.en) ∧ ψ(vp.rst) ∧ ψ(vp.en) ∧ We begin by introducing a notion representing a generic 0 ? ? concept of a data transfer between two vertices within a given ψ(v .rst) ∧ vp.en = 0 ∧ vp.rst = 0) ⇒ p PSG. Naturally, each such transfer must conform to the ϕwr 0 ? 0 ? (vp.en = 0 ∧ vp.rst = 0). and ϕrd mappings. We first formalize the notion of data trans- fers in a broader form in Definition 5, which is narrowed Intuitively, Φ(vp) says that if the values of vp.en, vp.rst, later on in Definition 6. In particular, Definition 5 is broader 0 0 vp.en, and vp.rst are computed according to the given flow in the sense that it may describe data transfers that can only 0 logic and vp is stalled, then vp is stalled too. Instead of be achieved when multiple instructions are involved and some checking validity of Φ(vp), one can check unsatisfiability of the instructions pass the data back to lower stages of the of the negation of the formula, i.e., ¬sat(¬Φ(vp)). More- pipeline where they are processed by instruction(s) that en- 0 over, as ¬Φ(vp) = ψ(vp.en) ∧ ψ(vp.rst) ∧ ψ(vp.en) ∧ tered the pipeline later. This would mean that a spoiler itself 0 ? ? 0 ? ψ(vp.rst) ∧ vp.en = 0 ∧ vp.rst = 0 ∧ (vp.en = (and likewise a victim) could consist of multiple instruc- 0 ? 1 ∨ vp.rst = 1), the check ¬sat(¬Φ(vp)) can be replaced tions. Dealing with such situations is, of course, relevant, but 5 by the following two simpler checks: we will restrict ourselves to the case of the spoiler and victim being single instructions each, generating the so-called for-  ?  ψ(vp.en) ∧ vp.en = 0 ∧ ward executions (Definition 6). Nevertheless, this does not  ?  mean that we ignore multi-instruction hazards completely. ¬sat  ψ(vp.rst) ∧ vp.rst = 0 ∧  (1)   Instead, thanks to the concepts introduced in Definitions 7, 0 0 ? ψ(vp.en) ∧ vp.en = 1 8, and 10, our approach is capable of generating alarms in situations where the potentially erroneous value is not made visible but ends up in a pipeline register from which an-  ?  other instruction can make its effect visible. Such alarms ψ(vp.en) ∧ vp.en = 0 ∧   may, however, be false, and our approach will not be able ¬sat  ψ(v .rst) ∧ v .rst? = 0 ∧  (2)  p p  to see that. ψ(v0 .rst) ∧ v0 .rst? = 1 p p Definition 5 Let G = (V, E, s, t) be a PSG and π an alter- nating sequence hv1, e1, . . . , ek−1, vki, k > 1, of vertices Hence, Rule 1 can be checked by applying the checks from 0 interconnected by edges, that is, v1,..., vk ∈ V , e1,..., Equations 1 and 2 to all vp, vp ∈ Vp such that ϕ(vp) = 0 ek−1 ∈ E, s(ei) = (vi, ci), and t(ei) = (vi+1, ci+1) for ϕ(vp). each 1 ≤ i < k and c1,..., ck ∈ T. Moreover, assume Rules 2–4 can be checked in a very similar way as Rule 1. that, in π, no vertex appears twice, i.e., i 6= j ⇒ vi 6= vj

5 0 for 1 ≤ i, j ≤ k. We say that a pair (π, τ) is an execu- Note that, in Equation 1, we may remove the ψ(vp.rst) conjunct 0 ? tion if there exists a valuation τ : {1,..., 2k − 1} → s.t. since the constraint vp.rst = 1 is not present, and likewise with S 0 wr ψ(vp.en) in Equation 2. vi ∈ Vr ⇒ τ(j) − 1 ∈ ϕ (vi) for all 1 < i ≤ k and 10 L. Charvat´ · A. Smrckaˇ · T. Vojnar j = 2i − 1. We denote π and τ as an execution path and Example 3. Consider the executions from Example 2. The an execution plan, respectively. We will further use X to de- execution (π1, τ1) is a forward execution while (π2, τ2) is fst lst MxProg.sel PC note the set of all executions. Moreover, we use π and π not since τ2(8 ) 6= τ2(7 ). / v v π to denote the first 1, resp., the last k of the path . Ana- For further explanation, it is important to be able to iden- τ fst τ lst logically, we will use shortcuts and in order to refer tify a register from which the transferred data can be passed τ(1) τ(2k − 1) to the valuation of the first and last element to another (later) instruction. Such an action occurs only if π τ fst = τ(1) of the execution path , respectively, i.e., and there exists a path leading from a register in a higher stage τ lst = τ(2k − 1) . to a register that belongs to a lower one. This is formalized Intuitively, given an execution, its execution plan repre- in the next definition. sents a sequence of stages in which particular vertices are Definition 7 A pipeline register v ∈ Vp is a pivot if there written during a data transfer. Hence, taking into account rd exist a stage sr ∈ ϕ (v) s.t. sr ≤ ϕ(v). the unit delay of writing, the value written to a vertex vi is obtained from a value computed in the stage τ(j) − 1, We also need to establish a notion of a stage that can j = 2i − 1 (with the first element of the path being special be cleared without the previous stage being stalled. Such and excluded from this requirement). a stage can be used to nullify the state of a partially executed Example 2. Consider the PSG G depicted in Fig. 2. A pair instruction. (π1, τ1) s.t. π1 = hX , MxMem.sel, MxMem, ExMem.d, Definition 8 A stage s ∈ S is independently clearable if ExMem, MxOp.c , MxOp, Eq.a , Eq, MxAlu.c , MxAlu, 0 0 1 0 there exist pipeline registers vp, vp ∈ Vp s.t. ϕ(vp) = s = X MxMem.sel MxMem A.d, Ai and τ1 = {1 7→ 3, 2 7→ 3, 3 7→ 0 0 ϕ(vp) + 1 and rst(vp) ∩ st(vp) 6= ∅ where st and rst are the ExMem.d ExMem MxOp.c0 MxOp 3, 4 7→ 3, 5 7→ 4, 6 7→ 4, 7 7→ mappings defined in Section 5.2. 4, 8Eq.a1 7→ 4, 9Eq 7→ 4, 10MxAlu.c0 7→ 4, 11MxAlu 7→ 4, 12A.d 7→ 4, 13A 7→ 5} is an execution in G describing one We decide whether a stage satisfies the above given con- of the possible data transfers from the register X to the reg- strains for being independently clearable in a similar way ister A. Note that we indexed the left-hand sides of the map- to Rules 1–4. More precisely, an SMT solver performs the pings by the corresponding registers to make the mappings following check in this case: more readable and to allow the reader to check that the ex-   0 0 ecution indeed obeys the rules from Definition 5 when con- ψ(vp.rst) ∧ ψ(vp.en) ∧ ψ(vp.rst) sat   (3) wr ? 0 ? 0 ? sidering the ϕ mapping from Table 1. vp.rst = 1 ∧ (vp.en = 1 ∨ vp.rst = 1) Another example of an execution is a pair (π2, τ2) where The above check can be further decomposed into two sim- π2 = hExJmp, MxJmp.sel, MxJmp, MxPC .sel, MxPC , pler checks while it suffices that at least one is satisfiable: PC .d, PC , MxProg.sel, MxProg, IdIr.d, IdIri and τ2 = {1ExJmp 7→ 4, 2MxJmp.sel 7→ 4, 3MxJmp 7→ 4, 4MxPC .sel 7→   ? 4, 5MxPC 7→ 4, 6PC .d 7→ 4, 7PC 7→ 5, 8MxProg.sel 7→ 1, ψ(vp.rst) ∧ vp.rst = 1 sat   (4) MxProg IdIr.d IdIr 0 0 ? 9 7→ 1, 10 7→ 1, 11 7→ 2}. / ψ(vp.en) ∧ vp.en = 1 To narrow our selection only to executions that are fea-   sible by a single instruction, one needs to only think of ex- ? ψ(vp.rst) ∧ vp.rst = 1 ecutions tied with execution plans where stages form a non- sat   (5) 0 0 ? decreasing sequence. Intuitively, a single instruction in the ψ(vp.rst) ∧ vp.rst = 1 pipeline can only move forward or stay in the same stage. This leads us to the definition given next. In the next step, we define an execution that can be per- formed by a single instruction and which may influence the Definition 6 A forward execution is a special type of execu- value stored in some register. tion (hv1, e1, . . . , vki, τ) ∈ X, k > 1, where the following restrictions hold for all 1 < i ≤ k and all 1 ≤ j < k: (i) vi ∈ Definition 9 A store execution is a forward execution (hv1, Vr ⇒ τ(2i−1) = τ(2i−2)+1, (ii) vi ∈ Vf ⇒ τ(2i−1) = e1, . . . , ek−1, vki, τ) for some k > 0, v1, vk ∈ Vr, so that τ(2i − 2) , and (iii) ej ∈ E ⇒ τ(2j) = τ(2j − 1). v2, . . . , vk−1 6∈ Vr. We also define a maximal store execu- tion as a store execution that is not a suffix of any other store As can be seen, the constraints of Definition 6 require that execution. the stage associated with a register should increase (Condi- tion (i)) and that the stage associated with a Boolean circuit As a final step, we define an execution that can be per- or an edge should remain the same (Conditions (ii) and (iii)). formed by a single instruction and which may influence the Clearly, if any of the conditions is not met, there could not data stored in an architectural register va ∈ Va by reading be any single instruction capable of a data transfer described some data from a (potentially different) register v ∈ Vr and by the execution. transferring them to the register va. Utilizing Parametric Systems For Detection of Pipeline Hazards 11

Definition 10 A transfer execution is a forward execution Definition 12 A WAR hazard case is a tuple (χsp , χvi ) ∈ 2 sp (hv1, e1, . . . , ek−1, vki, τ) for some k > 0, vk ∈ Vr that X consisting of a maximal store execution χsp = (hv1 , sp sp sp satisfies the following two properties: (i) The register vk sat- e1 ,..., ek−1 = vk−1.d, vk = vi, τsp ) of a spoiler in- vi isfies one of the following: (a) it is an architectural register struction and a minimal transfer execution χvi = (hv1 = v, vi vi vk ∈ Va, (b) it is a pipeline register vk ∈ Vp s.t. t(ek−1) = e1 , . . . , v` i, τvi ) of a victim instruction where v ∈ Va \ (vk, rst) and ϕ(vk) is an independently clearable stage, or {vpc}, k, ` > 1, and data in the architectural register v can (c) the register vk ∈ Vp is a pivot s.t. t(ek−1) = (vk, d). be written by the spoiler before they are read by the victim, lst fst (ii) Moreover, t(ei) 6∈ Vp × {en, rst} for all 1 ≤ i < k. i.e., τsp < τvi . We also define a minimal transfer execution as a transfer execution that does not contain any prefix that is a transfer Definition 13 A WAW hazard case is a tuple (χsp , χvi ) ∈ 2 sp execution. X consisting of a maximal store execution χsp = (hv1 , sp sp sp e1 ,..., ek−1 = v.d, vk = vi, τsp ) of a spoiler instruction Condition (i-a) is straightforward as the execution af- vi vi vi and a maximal store execution χvi = (hv1 , e1 , . . . , e`−1 = fects the architectural register directly in this case. Clear- vi v.d, v` = vi, τvi ) of a victim instruction where v ∈ Va \ ing the target pipeline register vk ∈ Vp in an independently {vpc}, k, ` > 1, and data into the architectural register v can clearable stage as described in Condition (i-b) causes can- be written from two different stages. In the following, with- cellation of any partially executed instruction in Stage ϕ(vk). out a loss of generality (since the conflicting instructions can Such an event may indirectly influence any architectural reg- always be swapped), we will assume the spoiler to perform ister va ∈ Va that belongs to a stage s ≥ ϕ(vk). Simi- lst lst a write operation in an earlier stage, i.e., τsp < τ . larly, concerning Condition (i-c), if the target pipeline reg- vi ister vk ∈ Vp is a pivot, the value read from it—by a later One can observe that there is no need to include any min- instruction—may also indirectly influence any architectural imal transfer execution in the case of WAW hazard since an register that the later instruction writes to. Next, as described error that is caused by the hazard is manifested instantly by by Condition (ii), the transfer execution must not traverse writing an incorrect value to the register v. through enable connections of pipeline registers. Such exe- cutions cannot influence the value of any architectural regis- Definition 14 A CTL hazard case is a tuple (χsp , χvi ) ∈ 2 sp ter. Their only impact can be that they stall a stage. This also X consisting of a maximal store execution χsp = (hv1 , sp sp sp holds for reset connections of pipeline registers in a stage e1 ,..., ek−1 = vk−1.d, vk = vpci, τsp ) of a spoiler in- vi that is not independently clearable—in this case, an instruc- struction and a minimal transfer execution χvi = (hv1 = vi vi tion cannot be lost since the previous stage is always stalled. vpc, e1 , . . . , v` i, τvi ) of a victim instruction where k, ` > vi In such a case, the pipeline consistency given by Rules 1–4 1, vpc 6= v` . Moreover, the vpc register must be written from Section 5.2 assures correct preservation of all partially with data originating from a source other than the program executed instructions. counter’s auto-increment logic, which we consider to appear An incorrectly handled pipeline hazard manifests upon in Stage 1. Therefore, the spoiler must always write from lst the first write of improper data into some architectural reg- a stage other than the first one, i.e., τsp > 2. ister of the design. Therefore, it suffices to further deal with the minimal transfer executions only. We can now formalize Note that, since the definition of a particular hazard case the notion of hazard cases in a unified way for all the differ- speaks about registers, their access stages, and the path along ent kinds of hazards (restricted to the case when the spoiler which the problematic data are transferred, it is not defined and victim consist of single instructions) as follows. In par- for a single concrete instruction only but for an entire class 2 of instructions that conform to the criteria given by the haz- ticular, we represent a hazard case as a tuple (χsp , χvi ) ∈ X lst fst where χsp and χvi are spoiler and victim executions appro- ard case. Further, note that the cases when τsp = τvi for priate for the concerned kind of hazard. A more rigorous RAW, WAR, and CTL hazards as well as the cases when lst lst description of each considered type of hazard cases is given τsp = τvi for WAW hazards are not covered by the above in the following definitions. definitions. This is because our approach assumes correct execution of isolated instructions, which rules such cases Definition 11 A RAW hazard case is a tuple (χsp , χvi ) ∈ out. Such correctness can be checked separately using, e.g., 2 sp X consisting of a maximal store execution χsp = (hv1 , methods described in [7,9]. Finally, albeit CTL hazards are sp sp sp e1 ,..., ek−1 = vk−1.d, vk = vi, τsp ) of a spoiler in- quite similar to RAW hazards (as CTL hazards could be con- vi struction and a minimal transfer execution χvi = (hv1 = v, sidered a special kind of RAW hazards on the vpc register), vi vi e1 , . . . , v` i, τvi ) of a victim instruction where v ∈ Va \ we prefer to keep them separate due to CTL hazards do still {vpc}, k, ` > 1, and data in the architectural register v can come with some special constraints (such as the treatment of be read by the victim instruction before they are written by the program counter’s auto-increment logic) and due to the fst lst the spoiler, i.e., τvi < τsp . different nature of errors that may typically arise from them. 12 L. Charvat´ · A. Smrckaˇ · T. Vojnar

In order to generate the set H of hazard cases, we pro- Algorithm 1 Procedure computing a set of hazard cases H. ceed as follows. First, using results of the data-flow analysis Require: A PSG G = (V, E, s, t, ω), a set Va ⊆ V of architectural from Section 5.1, we find all registers va ∈ Va for which registers, a program counter vpc ∈ Va, a set Vp ⊆ V of pipeline there is a risk that some hazard situation may be initiated registers, Va ∩ Vp = ∅, a set Vpivot ⊆ Vp of pivots, and a set Sic ⊆ S of independently clearable stages. between stages s1, s2 ∈ . The conditions that must hold S Ensure: A set H ⊆ X × X of hazard cases in the CPU encoded by G. for s1, s2 differ for different hazard cases. For instance, for 1: Vt := Va ∪ Vpivot ∪ {v ∈ Vp | ϕ(v) ∈ Sic } RAW hazards, we need the following conditions to hold: 2: Let A denote Va × N × N × Vt × N wr s − 1 ∈ ϕwr(v ) s + 1 ∈ ϕrd(v ) s < s 3: ARAW := {(va, s1, s2, vt, s3) ∈ A | s1 − 1 ∈ ϕ (va) ∧ 1 a , 2 a , and 2 1. The rd wr s2 + 1 ∈ ϕ (va) ∧ s2 < s1 ∧ s3 − 1 ∈ ϕ (vt) ∧ s2 ≤ s3} condition s2 < s1 reflects the fact that the needed data are wr 4: AWAR := {(va, s1, s2, vt, s3) ∈ A | s1 − 1 ∈ ϕ (va) ∧ rd wr read from va before they are written into va. The rest of the s2 + 1 ∈ ϕ (va) ∧ s1 < s2 ∧ s3 − 1 ∈ ϕ (vt) ∧ s2 ≤ s3} wr condition reflects that it must be possible to write to va in 5: ACTL := {(vpc , s1, 1, vt, s3) ∈ A | s1 − 1 ∈ ϕ (va) ∧ 2 ∈ ϕrd(v )∧s > 2∧s −1 ∈ ϕwr(v )∧v 6= v ∧s > 1} stage s1 and read in stage s2, i.e., it must have a predecessor pc 1 3 t pc t 3 6: A := ARAW ∪ AWAR ∪ ACTL register in stage s1−1 and a successor register in stage s2+1. 7: H := ∅ The subtraction/addition of 1 is applied due to the unit write 8: for (va, s1, s2, vt, s3) ∈ A do delay that happens between the data are read from the previ- 9: χsp := { (π, τ) ∈ X | π = hv1, e1, . . . , ek−1, vki ∧ v ({vk} × N × N × Vt × N) ∩ A 6= ∅ ∧ t(ek−1) = ous register and written to a and then between reading the lst (vk, d) ∧ v2, . . . , vk−1 6∈ (Va ∪ Vp) ∧ τ = s1 ∧ data from va and writing them to the successor register. For (π, τ) is a maximal store execution } other kinds of hazards, the conditions are derived from the 10: χvi := { (π, τ) ∈ X | π = hv1, e1, . . . , ek−1, vki ∧ fst lst kind of hazard analogously as for RAW hazards as shown ({v1} × N × N × {vk} × N) ∩ A 6= ∅ ∧ τ = s2 ∧ τ = later on. Second, we find all maximal store executions that s3 ∧ (π, τ) is a minimal transfer execution } 11: H := H ∪ (χsp × χvi ) terminate in the register va. Finally, we generate all mini- 12: end for 0 mal transfer executions originating from the va vertex of the 13: Let A denote Va × N × N 6 0 wr given PSG G. 14: AWAW := {(va, s1, s2) ∈ A | s1 − 1, s2 − 1 ∈ ϕ (va) ∧ s < s } The procedure for generating the set H is shown in Alg. 1. 1 2 15: for (va, s1, s2) ∈ AWAW do The procedure first constructs auxiliary sets ARAW , AWAR, 16: χsp := { (π, τ) ∈ X | π = hv1, e1, . . . , ek−1, vki ∧ and ACTL strictly following the constraints given by RAW, ({vk} × N × N) ∩ AWAW 6= ∅ ∧ t(ek−1) = lst WAR, and CTL hazard cases (see Definitions 11, 12, and 14). (vk, d) ∧ v2, . . . , vk−1 6∈ (Va ∪ Vp) ∧ τ = s1 ∧ (π, τ) is a maximal store execution } The sets ARAW , AWAR, and ACTL consist of quintuples 17: χ := { (π, τ) ∈ | π = hv , e , . . . , e , v i ∧ characterising suspected hazards. They include the architec- vi X 1 1 k−1 k ({vk} × N × N) ∩ AWAW 6= ∅ ∧ t(ek−1) = lst tural register va on which the hazard happens, the target (vk, d) ∧ v2, . . . , vk−1 6∈ (Va ∪ Vp) ∧ τ = s2 ∧ register vt through which the hazard manifests, and three (π, τ) is a maximal store execution } 18: := ∪ (χ × χ ) stages: namely, stages s1 and s2 in which the conflicting H H sp vi 19: end for read/write operations on va happen, and stage s3 in which 20: return H the hazard gets manifested. For WAW hazards, the proce- dure later on proceeds similarly, but there is no vt and s3 needed since the hazard manifests immediately upon the by Stage 5 (ϕrd(X) = {3, 4, 5}). By Definition 12, to form second write operation (Definition 13). The auxiliary sets a WAR hazard, the PSG must contain (i) a maximal store ex- are then used for finding maximal store and minimal trans- ecution of a spoiler instruction (π , τ ) ∈ ending in X fer executions in the PSG. A standard breadth-first search al- sp sp X and (ii) a minimal transfer execution (π , τ ) ∈ leading gorithm during which constraints from Definitions 5–10 are vi vi X from X to some target register. There are multiple execu- checked on-the-fly can be used to obtain the minimal trans- tions of spoiler and victim instructions that satisfy the above fer executions in G for the suspected hazards. Similarly, the criteria. Each of them must be considered in order to verify procedure may deploy the depth-first search algorithm while that the design is free of WAR hazards. For instance, one checking constraints from Definitions 5, 6, and 9 in order to may consider a spoiler execution (πsp , τsp ) with πsp = hX, find the maximal store executions. X Inc.a0, Inc, MxInc.c0, MxInc, X.d, Xi and τsp = {1 7→ Example 4. Consider the PSG from Fig. 2 and the mappings 2, 2Inc.a0 7→ 2, 3Inc 7→ 2, 4MxInc.c0 7→ 2, 4MxInc 7→ 2, shown in Table 1. One can see that there is a potential WAR 5X.d 7→ 2, 6X 7→ 3}. Further, we can consider a victim exe- hazard on the index register X ∈ Va because, for example, cution (πvi , τvi ) with the target memory cell Memj written wr it can be written in Stage 3 (ϕ (X) = {2, 3, 4}) and read in Stage 5 where πvi = hX, Cmpj.a0, Cmpj, MxSel j.c1, MxSel j, Memj.en, Memji. An instance of an execution 6 For WAW hazards, in the final step, we generate maximal store plan τ for the path π is {1X 7→ 4, 2Cmpj .a0 7→ 4, executions instead. In this case, the error caused by the hazard is im- vi vi Cmp MxSel .c MxSel Mem .en mediately visible from the programmer’s point of view, and there is no 3 j 7→ 4, 4 j 1 7→ 4, 5 j 7→ 4, 6 j 7→ need of its propagation to another architectural register. 4, 7Memj 7→ 5}. The given pair of a spoiler and victim Utilizing Parametric Systems For Detection of Pipeline Hazards 13 is clearly a candidate for a WAR hazard since the needed means of the pipeline flow logic, e.g., an earlier instruction data are overwritten before they are read (unless some con- may force a later instruction to be stalled or cleared. Fi- trol logic over the involved executions prevents the hazard, nally, the instructions end up in a state denoting that they which will be the subject of further checking). / left the pipeline. In the following explanation, we start by constructing Example 5. Further, as an example of a control hazard, the set of states of the system P . Then, we proceed to cap- one can consider a spoiler execution (π , τ ) with π = sp sp sp turing the above mentioned influence of the pipeline flow hExAlu, MxAlu.sel, MxAlu, MxPC .c , MxPC , PC .d, 1 logic and reflect it in the transition relation of the system P . PC i and τ = {1ExAlu 7→ 4, 2MxAlu.sel 7→ 4, 3MxAlu 7→ sp Finally, we define the set of minimal bad configurations of 4, 4MxPC .c1 7→ 4, 5MxPC 7→ 4, 6PC .d 7→ 4, 7PC 7→ 5}. the system P that describes the prohibited interleavings of As an instance of a victim execution (π , τ ), we can con- vi vi instructions causing the hazard. sider an execution path πvi = hPC , MxProg.sel, MxProg, PC IdIr.d, IdIri with an execution plan τvi = {1 7→ 1, 2MxProg.sel 7→ 1, 3MxProg 7→ 1, 4IdIr.d 7→ 1, 5IdIr 7→ 2}. 7.1 States and Edge Conditions of the Parametric System Note that, in this case, IdIr 6∈ Va, but we know from Ta- 2 ble 1 that the pipeline register IdIr is a pivot, and so it is Given a hazard case of the form (χsp , χvi ) ∈ X , χsp = still a valid terminating element for a transfer execution. / (πsp , τsp ), χvi = (πvi , τvi ), the parametric system P will model interactions among four classes of processes K := {sp “spoiler”, vi “victim”, sf “stall-flow”, nf “normal-flow”} 7 Parametric Systems for Potential Hazards (where each process represents an executing instruction of some class). This follows the fact that each type of the con- We will now describe how the potentially hazardous be- sidered pipeline hazard is caused by some pair of instruc- haviour of a spoiler and a victim instruction described by tions. The sp class represents the spoiler part of the hazard a hazard case can be modelled and checked for feasibility case, i.e., an instruction that writes to a register v ∈ Va in using a parametric system P . Namely, the parametric sys- a stage τsp (v). The vi class then represents an instruction tem P will be constructed such that if the behaviour is found corresponding to the victim part of the hazard case, reading infeasible by analysing P , the hazard case does not describe or writing from/to v in a stage τvi (v). Further, the sf and nf a real hazard (the suspected hazard gets prevented by the classes both denote any other instructions than the spoiler pipeline flow logic). Intuitively, in the system P , we map and victim—we just differentiate two operating modes of n ≥ 2 instructions in the pipeline to n concurrently running these instructions. As we will discuss later in Section 7.2, processes in a linear array (with the earliest instruction on the difference between the stall- and normal-flow operation the left). modes is that an sf -class instruction in a stage s0 ∈ S causes Note that the notion of parametric systems does not limit that all pipeline stages s ∈ S s.t. s < s0 get stalled. Both the the value of n from above, and verification methods de- sf and nf classes serve as a pipeline filler and a sink for signed for them must cope with that. Exploiting this fea- cleared (flushed) instructions. ture of parametric verification is not necessary in our case To facilitate the construction of a parametric system al- since the relevant values of n are limited by the length of lowing us to verify whether a given hazard case corresponds the pipeline. However, the use of parametric systems is still to a real hazard or not, we need to introduce an extended beneficial since while there is a single spoiler and victim and set of stages. Let S¯ := S ∪ {⊥, >} be the set of stages ex- the number of “padding” instructions in between of them tended with auxiliary initial “⊥” and final “>” stages. Intu- is bounded, we still do not know how many “padding” in- itively, the initial stage “⊥” will represent instructions that structions should be considered for the hazard to manifest. have not entered the pipeline yet. Likewise, the final stage Hence, instead of exploring all possible values of n, we use “>” will denote finished instructions that have already left parametric verification to perform the verification for any the pipeline. value of n, building on that the efficiency of the parametric We will then represent the behaviour of instructions given verification approach we use is quite sufficient for us. More- by a hazard case h = (χsp , χvi ) in the form of a labelled over, should our approach be extended to handle in a more parametric system, called a hazard system (HS), P h = (Q h , precise way even multi-instruction hazards, the value of n ∆h , αh ) where Q h := K×S¯, ∆h will be introduced in Sec- could become unbounded, and the power of parametric ver- tion 7.2, and αh : Q h → 2E is a state labelling function. The ification could become truly needed. labelling function αh associates each state with a set of edge In the parametric systems we construct, all instructions conditions that should hold in this state for the hazard to be are initially in a state saying that their execution has not executable. We will show the construction of the labelling started. Then, they proceed through individual stages of the below. Note that each state q ∈ Q h represents a unique in- pipeline during which they may interact with each other by struction class and a stage in which an instruction of this 14 L. Charvat´ · A. Smrckaˇ · T. Vojnar class is supposed to be. Finally, for a proper understanding either writing of the data into the register is enabled or the of the rest of the section, we once again stress that particu- register is cleared: lar states in Q h are states of individual instructions, not of h h αen(χ) := {(vk.en 1, τ(2k − 1)) | the entire system. A configuration of the system P is a se- (7) quence of such states. χ = (hv1, e1, . . . , ek−1 = vk.d, vki, τ)}, Next, we define the mapping αh describing which edge h αrst(χ) := {(vk.rst 1, τ(2k − 1) | conditions must hold in a state q = hκ, si ∈ Q h , which (8) χ = (hv1, e1, . . . , ek−1 = vk.rst, vki, τ)}. is a state of an instruction of the class κ ∈ K in the stage ¯ s ∈ S, for that instruction to execute in accordance with the h In particular, αen ensures that the data transferred along the hazard case h. First, for instructions of the classes κ = sf path described by the execution χ are indeed written to its and κ = nf , we define αh (hκ, si) := ∅ for every s ∈ ¯ S destination register vk at the end of the execution. Therefore, since we do not expect any special behaviour from instruc- h αen produces a singleton containing a pair consisting of the tions of these classes, and, on every realistic processor, we condition vk.en 1 and the stage τ(2k − 1), which is can always find instructions that do not interfere with the the stage where the data reside just prior to the write. Sim- spoiler and victim instructions and may serve as the needed h ilarly, αrst produces a singleton containing a pair consist- pipeline filler. Likewise, we define αh (hκ, si) := ∅ for any ing from the condition vk.rst 1 and the stage τ(2k − κ ∈ K and s ∈ {⊥, >}, i.e., for instructions that have not 1) so that the target register is indeed cleared. Using the yet started or that have already ended. above mappings, we can define αh for the given hazard case For the spoiler and victim instructions, the idea is to ex- h = (χsp , χvi ) such that the following holds for any state tract the edge conditions by looking for the necessary set- 7 hκ, si ∈ {sp, vi} × S¯: tings of selector, enable, and clear edges so that the data h h h involved in the potential hazard are carried over the paths α (hκ, si) := {c ∈ E | (c, s) ∈ αsel(χκ) ∪ αen(χκ) ∪ (9) πκ for κ ∈ {sp, vi} that are a part of the concerned spoiler h h αrst(χκ)}. and victim executions χκ = (πκ, τκ). The mapping α can h h be constructed from three auxiliary mappings αsel, αen, and Example 6. Assume the hazard case (χsp , χvi ) shown in h E×S h αrst : X → 2 where αsel will be examining all edges Example 4 for the microprocessor from Example 1. First, we but the last one (hence covering all edges that route the data focus on the spoiler execution χsp = (πsp , τsp ). Since the through multiplexers) and the last edge will be covered by microprocessor contains five pipeline stages, the spoiler gets h ¯ ¯ exactly one of the two remaining mappings (related to en- associated with the set of states Qsp := {sp}×S where S = abling a write of the data to the target register or clearing the {⊥, 1,..., 5, >}. We will now show how the αh mapping is h h register). computed for the states of Qsp . From the definition of α , it h In order to define αsel, we will use an auxiliary map- directly follows that ping σ : Vmx × E → E. Given a multiplexer vmx ∈ Vmx, h h the mapping σ captures the edge condition that must hold α (hsp, ⊥i) = α (hsp, >i) = ∅. over the multiplexer’s selector edge vmx.sel for the data on For the states hsp, 1i, ..., hsp, 5i, one has to first compute the the i-th inbound-case edge vmx.ci to be propagated to the h h h auxiliary mappings αsel, αen, and αrst from Equation 9. As multiplexer’s outbound edge vmx.q. Hence, the X register is written via its d connection, it immediately follows that σ(vmx, vmx.ci) := vmx.sel binω(vmx.sel)(i) h αrst(χsp ) = ∅. n where binn : Z → B is the standard two’s complement π χ encoding of a decimal value on n bits. Next, since the path sp of the spoiler store execution sp passes through a single multiplexer, namely, MxInc, via the Now, the αh mapping is defined as sel MxInc.c0 edge MxInc.c0 with τsp (4 ) = 2, we get h αsel(χ) := {(σ(vi, ei−1), τ(2i − 1)) | 1 < i < k ∧ h (6) αsel(χsp ) = {(MxInc.sel 0, 2)}. vi ∈ Vmx ∧ χ = (hv1, e1, . . . , ei−1, vi, . . . , vki, τ)}. h For αen, we only need to assure that the register X is written Intuitively, the αh mapping produces a set of pairs consist- X .d sel at the end of the execution. Since τsp (5 ) = 2, we let ing of a condition σ(vi, ei−1) ∈ E over selector edges that h is required by the multiplexer vi ∈ Vmx to propagate the αen(χsp ) = {(X .en 1, 2)}. data along the execution path π and the stage τ(v ) in which i 7 h Note that the executions can also end by an vk.en edge. However, the particular condition must be satisfied. Similarly, the αen h in this case, no matter what the value of the enable signal is a hazard and αrst mappings establish the necessary condition for the happens by enabling/not enabling a write of some data into an archi- final edge of the execution’s target register, making sure that tectural register. Hence, no further condition is needed in this case. Utilizing Parametric Systems For Detection of Pipeline Hazards 15

Finally, by uniting the above computed auxiliary mappings, configurations of the TS T h = (C ,,→) induced by the PSG Qh we get that and (ii) a predicate csat h ⊆ 2E × 2 .

h The purpose of the unwind h mapping is to compute all α (hsp, 2i) = {MxInc.sel 0, X .en 1} configurations of the TS T h in which T h (and hence the pro- cessor it represents) can be when the processor contains an and ∀i ∈ S \{2}: αh (hsp, ii) = ∅. Analogically, for the instruction of a class κ in a stage s while executing within victim execution χ = (π , τ ) of the analyzed hazard vi vi vi the given hazard case h. The considered configurations must case, we would infer that be such that the processor can reach them by going through h all preceding stages and such that the processor can finish αsel(χvi ) = {(MxSel j.sel 1, 4)} the execution of the instruction by going through all its fur- h h and αrst(χvi ) = αen(χvi ) = ∅. Therefore, we get that ther stages, all the time executing within the hazard case h. In particular, let m = max( ) be the number of stages and h S α (hvi, 4i) = {MxSel j.sel 1} let hκ, si ∈ Q h be an instruction state representing an in- struction of a class κ in a stage s within a hazard case h. and ∀i ∈ S \{4}: αh (hvi, ii) = ∅. / Then, unwind h (hκ, si) consists of exactly all those config- urations c0 ∈ C such that there is a sequence of consecutive states hc , ..., c , ..., c i in T h that conforms to the 7.2 The Transition Relation of the Parametric System −s+1 0 m−s following rules: For the construction of the transition relation ∆h presented ∀ − s < i < m − s: c ,→ c , (10) later on, we will first introduce three predicates that charac- i i+1 terise mutual interactions of pairs of instructions whose ex- h ecution has reached some states q1, q2 ∈ Q of the verified ∀ − s < i ≤ m − s: c ∈ γ(αh (hκ, s + ii)). (11) h i HS P . We stress that q1 and q2 are states of the execution of two considered instructions, which are of course a part of Note that the negative/positive index in the above equations h a single configuration of the HS P . Before providing rig- simply means moving forward to/away from the pipeline orous definitions of the predicates, which are given later in start when considering the origin given by the index of 0, this section, we first provide some intuition behind them. respectively. The first constraint above ensures that we in- h A pair of states q1, q2 ∈ Q and a stage s ∈ S satisfy the deed consider a trace in the TS T h . The second condition st ternary stage stall predicate ←→ ⊆ Q h × S × Q h provided then ensures that the trace passes all stages of an instruction h that the edge conditions associated with the states q1 and q2 of the given class while the processor is executing within the ensure that the stage s is stalled, and thus the contents of all given hazard case. pipeline registers of s stays unchanged. We will further use The above described computation of the unwind h map- st st ping can be implemented symbolically using a BVL for- the shorthand q1 ←→ q2 for (q1, s, q2) ∈ ←→. h,s h mula unwind ? (q) for any q ∈ Q h . To describe the com- h h Further, a pair of states q1, q2 ∈ Q and a stage s ∈ S putation, we introduce the notation ,→? to denote the cl (i,i+1) satisfy the ternary stage clear predicate ←→ ⊆ Q h × S × Q h result of a (straightforward) conversion of the relation ,→ to h provided that the stage s is cleared, i.e., the contents of all a BVL formula where all variables representing the current pipeline registers of s is nullified. We will further use the state of the TS T h are indexed with i and those represent- cl cl ing the future state are indexed with i + 1. Moreover, as in shorthand q1 ←→ q2 for (q1, s, q2) ∈ ←→. h,s h ? Section 5.2, we use ei to denote the conversion of an edge h Finally, a pair of states q1, q2 ∈ Q satisfies a binary e ∈ E indexed with the trace index i to a BVL variable. cf state conflict predicate ←→ ⊆ Q h × Q h provided that the Then, given q = hκ, si ∈ Q h \ K × {⊥, >}, the BVL for- h ? given processor excludes a configuration where two instruc- mula unwind h (q) is obtained as follows: tions would appear in the states q , q at the same time. We 1 2 m−s−1 cf cf V ? will further use the shorthand q1 ←→ q2 for (q1, q2) ∈ ←→. F1 := ,→(i,i+1), h h i=−s+1 For instance, one of the typical scenarios when two states m−s V V ? q , q ∈ Q h are in a state conflict occurs when there exists F2(q) := ei = b, 1 2 h h h i=−s+1 e b∈α (hκ,s+ii) an edge e ∈ E so that e b1 ∈ α (q1)∧e b2 ∈ α (q2), V ? ? F3 := e = e , b1, b2 ∈ , while b1 6= b2. 0 B e∈E In order to formally define the above described predi- unwind ? (q) := ∃E : F ∧ F (q) ∧ F . cates, we first introduce two auxiliary notions: in particular, h 1 2 3 h C (12) (i) a mapping unwind h : Q → 2 where C is the set of 16 L. Charvat´ · A. Smrckaˇ · T. Vojnar

h Above, the existential quantification ranges over the set E = Definition 16 For any instruction states q1, q2 ∈ Q and {e? | e ∈ E ∧ −s < i ≤ m − s}. Its reason is to get rid cl i any stage s ∈ S, the stage clear predicate q1 ←→ q2 is of the concrete past and future values of the variables that h,s defined as follows: appear in the execution, keeping only their impact on the 8 cl current values of the variables. Finally, in order to extend q1 ←→ q2 ⇐⇒ ∃ vp ∈ Vp : ϕ(vp) = s ∧ 0 h,s the definition of unwind h for initial and final states q ∈ (16) ? 0 ¬csat ({v . 0}, {q , q }). K × {⊥, >}, we define unwind h (q ) := true. h p rst 1 2 Further, we proceed to the second auxiliary predicate: Note that the definition requires that the representative csat . The csat predicate determines satisfiability of a set h h register must be cleared (since the formula cannot be satis- of edge conditions I ⊆ in a situation when the pipeline E fied with the v .rst edge being zero). The consistency rules contains instructions in states from a set S ⊆ Q h . Formally, p then assure that the same holds for all registers of the given it is defined as follows: stage. \ \ cf csat (I,S) ⇔ γ(c) ∩ unwind (q) 6= ∅. (13) To define the ←→ predicate, we only need to be able to h h h c∈I q∈S determine whether two given instruction states are prohib- ited from occurring together in a single pipeline configura- The evaluation of csat (I,S) can be naturally reduced to h tion by the control logic of the considered processor. This is, checking the satisfiability of a BVL formula as follows: however, easy thanks to the csat h predicate as shown below.   ^ ? ^ ? q q ∈ Q h csat h (I,S) ⇔ sat e = b ∧ unwind h (q) . Definition 17 For any instruction states 1, 2 , the cf e b∈I q∈S state conflict predicate q1 ←→ q2 is defined as follows: (14) h cf q1 ←→ q2 ⇐⇒ ¬csat h (∅, {q1, q2}). (17) Now, the predicate csat h can be used to precisely define h cf the needed predicates ←→st , ←→cl , and ←→ as follows. h h h Intuitively, the expression csat h (∅, {q1, q2}) does not put any constraints on edge conditions, but it still checks h Definition 15 For any instruction states q1, q2 ∈ Q and whether some concurrently executing instructions can si- st any stage s ∈ S, the stage stall predicate q1 ←→ q2 is de- multaneously get into states q1 and q2. Hence, its negation h,s fined as follows: says that this is excluded in the given processor, allowing us cf to define the ←→ predicate. st h q1 ←→ q2 ⇐⇒ ∃ vp ∈ Vp : ϕ(vp) = s ∧ h,s Example 7. In this example, we will demonstrate how the (15) st ¬csat h ({vp.en 1}, {q1, q2}) ∧ predicate ←→ can be evaluated for a given pair of states and h ¬csat h ({vp.rst 1}, {q1, q2}). a given stage. Let us consider states hsp, 2i, hvi, 3i, Stage 2, and the hazard case h = (χsp , χvi ) from Example 4. Here, Intuitively, the definition requires that the presence of the spoiler instruction in state hsp, 2i writes into the reg- some instructions in states q1 and q2 in the pipeline ensures ister X the (auto-incremented) value previously read from that there is a pipeline register vp in stage s, which we denote the same register. The victim instruction in state hvi, 4i then as a representative register below, such that the value of vp reads the value j from the register X and uses it as an index can neither be updated nor cleared, i.e., vp keeps its value. to access the memory cell Memj. Note that the already established validity of the consistency From Definition 15, we know that, in order to determine Rules 1 and 4 implies that the setting of any control edge the value of hsp, 2i ←→st hvi, 3i, one has to (i) pick a rep- (en, rst) is the same for all pipeline registers across the h,2 resentative pipeline register v ∈ {v ∈ V | ϕ(v) = 2}, given pipeline stage, and so the fact that some representative p p (ii) evaluate Φ := ¬csat({v .en 1}, {hsp, 2i, hvi, 3i}), register is stalled means that all registers of the given stage 1 p and (iii) evaluate Φ := ¬csat({v .rst 1}, {hsp, 2i, are stalled (and the instruction that is now in stage s stays in 2 p hvi, 3i}). The representative register v for Stage 2 can be it). p picked randomly—indeed, as we have already said, the con- In a similar fashion, we define the ←→cl predicate. h sistency Rules 1 and 4 (from Section 5.2) imply that the setting of any control edges (en, rst) is the same for all 8 In our implementation of the approach, we replace the existential pipeline registers in a particular pipeline stage. quantification by simply pruning away all variables unrelated with any As for Step (i) above, it suffices to look in Table 1 and e? for any e ∈ E and by renaming the remaining variables in a unique way such that no conflicts arise when constructing more complex for- choose, for instance, IdIr as the representative register. More- ? mulae on top unwind h (q). over, in Example 1, we have pointed out that the value of the Utilizing Parametric Systems For Detection of Pipeline Hazards 17 enable edge on the IdIr register is determined by the fol- set and the reset edge rst is unset. Moreover, if rst is set, lowing expression in BVL: then the data-out q is nullified. Otherwise, when both en and

? ? ? rst are unset, the data-out edge q keeps the value from the IdIr.en = ¬IncX .q ∨ ¬OfWrMem.q . (18) previous cycle. Further, in Example 6, we have seen that Now, to address Step (ii), we know that, according to α(hvi, 4i) = {MxSel j.sel 1}, Equation 14, Φ1 expands to

¬sat(unwind ? (hsp, 2i) ∧ unwind ? (hvi, 3i) ∧ which imples that the formula F2(hvi, 3i) must ensure h h (19) ? ? IdIr.en = 1). MxSel.sel1 = 1. (24) ? We further concetrate on the expansion of unwind h (hsp, 2i). By combining the observations from Formulae 23 and 24, According to Equation 12, we need to construct formulae and by adding Formula F3 and the existential quantification F1, F2(hsp, 2i), and F3. First, the transition relation described of Equation 12, we obtain the following statement: 9 by Formula F1 contains the following conjuncts : ? ?  ? ? ? (ExWrMem.en = 1) ⇒ (OfWrMem.q = 1) ∧ Impl.q0 = (IncX .q0 ⇒ ExWrX .q0) ∧ (20) ? ? ? ExWrMem.rst = 0. MxInc.sel0 = Impl.q0. (25) To see that the above holds, it suffices to check how the value ? of MxInc.sel is computed from its predecesors in the PSG Here, the ExWrMem.rst = 0 conjuct comes from the shown in Fig. 2. The formula F2(hsp, 2i) then gives fact that the data-out edge must not be zero because of the constraint in Formula 24. ? ? MxInc.sel0 = 0 ∧ X .en0 = 1, (21) Next, according to the consistency Rule 3 from Sec- tion 5.2, which holds globally at any pipeline cycle, any which is a direct consequence of the result that we have ob- pipeline stage that directly precedes the currently stalled one tained in Example 6 where we have shown must also be stalled. By induction, the rule can be general- α(hsp, 2i) = {MxInc.sel 0,X.en 1}. ized to any preceding stage. Therefore, the following expres- sion must hold: Finally, Formula F3 simply asserts equality between zero- indexed and non-indexed variables. We can then apply the (ExWrMem.en? = 0 ∧ ExWrMem.rst? = 0) ⇒ existential quantification from Equation 12, which allows us (26) (IdIr. ? = 0 ∧ IdIr. ? = 0). to get rid of the indexed variables, leading to that the below en rst equality must hold: In particular, the above comes from the fact that ϕ(IdIr) < IncX .q? = 1. (22) ϕ(ExWrMem). By applying the modus tollens rule on Formula 26, we Now, we will apply a similar approach to expand the for- ? get mula unwind h (hvi, 3i). In this case, the following conjuncts of Formula F turn out to be relevant: 1 (IdIr.en? = 1 ∨ IdIr.rst? = 1) ⇒ (27) ? ? ? ? ExWrMem.d0 = OfWrMem.q0 ∧ (ExWrMem.en = 1 ∨ ExWrMem.rst = 1). ExWrMem.q? = f next (ExWrMem.q?, 1 ExWrMem 0 Finally, if we put together our observations made in Formu- ? ? ExWrMem.d0, ExWrMem.en0, (23) lae 18, 22, 25, and 27, we can conclude that the expression ? ExWrMem.rst0) ∧ unwind ? (hsp, 2i) ∧ unwind ? (hvi, 3i) ∧ IdIr.en? = 1 ? ? h h MxSel j.sel1 = ExWrMem.q1.

is not satisfiable. Thus, the expression Φ1 evaluates to true. Above, f next is the next-state function that was de- ExWrMem Analogically, for Step (iii), we would also derive that fined in Section 3.2 and that propages the value on the data- st Φ2 is true, and therefore the predicate hsp, 2i ←→ hvi, 3i in edge d to the data-out edge q iff the enable edge en is h,2 necessarily holds. In other words, this means that the NOP 9 The entire formula is, of course, much bigger—indeed, it describes injection into Stage 3 takes place whenever there is a spoiler the entire transition relation. When the satisfiability checking is done automatically, the solver will consider the entire formula. However, we defined by χsp in Stage 2 and a victim described by χvi in select its relevant parts only so that the example is readable. Stage 3. / 18 L. Charvat´ · A. Smrckaˇ · T. Vojnar

We can now define transitions that the transition relation transition q → hκ0, s+1i, κ0 ∈ {nf , sf }, into ∆h . Formally, h h ∆ of the HS P contains. First, for every instruction state ∀q = hκ, si ∈ Qch , ∀q0 ∈ Q h , ∀κ0 ∈ {nf , sf }: q = hκ, si ∈ Q h , ∆h contains a transition q → q allowing 0 0 h cf 0 the instruction that is in q to stay in q whenever the state q (∃← : {q } |= q → hκ , s + 1i) ∈ ∆ ⇔ hκ, si ←→ q . h appears in a configuration of the pipeline of the given pro- (31) cessor (i.e., a configuration of the transition system induced h by P ) that contains a combination of instruction states q1, As for the possibility of new instructions entering the h q2 ∈ Q which causes the instruction in the state q to be pipeline, only the left-most instruction in a given configura- h stalled. Formally, ∀q = hκ, si, q1, q2 ∈ Q : tion that has so far not entered the pipeline is allowed to en- ter it. Moreover, new instructions cannot enter the first stage h st (∃↔ : {q1, q2} |= q → q) ∈ ∆ ⇔ q1 ←→ q2. (28) if it is stalled. More precisely, ∀q = hκ, ⊥i, q0 = hκ0, ⊥i, h,s h q1, q2 ∈ Q : As we have already mentioned at the beginning of Sec- 0 h tion 7, we use the stall-flow sf and normal-flow nf instruc- (∃← : {q } |= q → q) ∈ ∆ , (32) tion classes to model pipeline-filler instructions, i.e., to model h st all other instructions than the spoiler and victim. The differ- (∃↔ : {q1, q2} |= q → q ∈ ∆ ) ⇔ q1 ←→ q2. (33) ence between the stall- and normal-flow operation modes h,1 is that an sf -class instruction in a stage s0 ∈ S causes all Next, an instruction can proceed to the next stage iff pipeline stages s ∈ S s.t. s < s0 to be stalled. In other words, none of the above rules is applicable. To model this fact, an instruction stays in a state q = hκ, si ∈ Q h whenever q we use local transitions, building on that we define all global appears in a configuration of the pipeline containing an ear- transitions (used above) to be of a higher probability than the lier instruction in the stall-flow operation mode. Formally, local ones. Further, we add transitions reflecting that once ∀q = hκ, si, q0 = hsf , s0i ∈ Q h : finalized instructions stay in their final state forever. More h 0 h 0 rigorously, ∀hκ, si ∈ Qc : (∃← : {q } |= q → q) ∈ ∆ ⇔ s < s . (29) (hκ, si → hκ, s + 1i) ∈ ∆h , (34) Including stalls caused by stall-flow instructions is neces- sary as they may introduce otherwise unreachable configura- h tions of the verified HS P h . Moreover, since a (hκ, ⊥i → hκ, 1i) ∈ ∆ , (35) caused by some filler instruction may occur at any proces- h sor cycle, we will always allow random transitions between (hκ, max(S)i → hκ, >i) ∈ ∆ , (36) stall- and normal-flow operation modes of filler instructions in the upcoming explanation. (hκ, >i → hκ, >i) ∈ ∆h . (37) Next, an instruction in a state q = hκ, si ∈ Qch , Qch = To ensure a possibility of the pipeline being stalled by K × Sb, Sb = S \{max(S)}, is cancelled, i.e., yields a transi- tion q → hκ0, s + 1i, κ0 ∈ {nf , sf }, provided that q appears some filler instruction, we allow switching between stall- in a configuration of the pipeline in which there exist in- and normal-flow operation modes. More formally, ∀hsf , si, h structions in states q1 and q2 that cause the stage s + 1 to be hnf , si ∈ Qc : h h cleared. More formally, ∀q = hκ, si ∈ Qc , ∀q1, q2 ∈ Q , (hnf , si → hsf , s + 1i) ∈ ∆h , (38) ∀κ0 ∈ {nf , sf }:

0 h h (∃↔ : {q1, q2} |= q → hκ , s + 1i) ∈ ∆ ⇔ (hsf , si → hnf , s + 1i) ∈ ∆ . (39) cl  st  (30) q1 ←−−→ q2 ∧ ¬ q1 ←→ q2 . Finally, we recall that global (i.e., guarded) transitions h,s+1 h,s have a higher priority than local (i.e., unguarded) ones. That Note that for a successful clearing of an instruction in the is, only if no global transition can be applied (such as a stall), stage s, it is also required that s is not stalled at the same a local one may be applied (e.g., proceeding to the next time. stage). Additionally, the transition relation ∆h is constructed For the case when our over-approximating abstraction under the assumption that, in each step of the transition sys- allows two states q and q0 that are conflicting to be reached tem induced by the HS P h , each instruction whose state is in a single configuration of the transition system induced by a part of the given configuration of P h must make a step. h the HS P , we introduce the following solution to reduce This is, if we take, e.g., a configuration q1q2q3 consisting of the number of possible false alarms. Namely, we kill the in- three states of three instructions, all of the three instructions struction that entered the pipeline later assuming that this must synchronously fire some of the above described transi- 0 0 0 instruction is in the state q = hκ, si, i.e., we introduce the tions such that we get the successor configuration q1q2q3. Utilizing Parametric Systems For Detection of Pipeline Hazards 19

e ` h Table 2: Roles of -/ -class instructions in hazards cases. the set B> and that is reachable from the set of initial config- h urations LC: I>. Note, however, that the control states of the Hazard e-class Role `-class Role earlier/later instructions that signify that something relevant for the hazard has happened (some critical value has been RAW writes spoiler (too slow) reads victim written or read) do not necessarily occur at the same time. WAR reads victim writes spoiler (too fast) On the other hand, hazard pairs consist of pairs of states that WAW writes victim writes spoiler (too fast) should be reached at the same time. To resolve this discrep- CTL writes spoiler (too slow) jumps victim ancy, we will pass information that the critical control state of an instruction has been reached to its successor states. For that, we will introduce several auxiliary notions, which will 7.3 Construction of the Minimal Bad Set be introduced such that the detection of the different kinds of hazards may be described in an as uniform way as possible. In the previous section, we have constructed a hazard sys- We first introduce the hazard distance δ that, intuitively, tem P h = (Q h , ∆h , αh ) that models possible interactions determines the maximum delay (measured in pipeline cy- of a spoiler and a victim instruction, forming a hazard case cles) with which the later instruction can still cause a hazard. 2 h = (χsp , χvi ) ∈ H ⊆ X , surrounded by other instruc- Intuitively, the basis of the distance is the difference in the tions during a pipelined execution. We now need to be able number of the stages in which the colliding read/write oper- to check whether some kind of data or control hazard occurs. ations happen within the concerned instructions. However, To facilitate detection of possible hazards from the con- sometimes, this basic difference has to be decreased by one h since one of the colliding operations must appear by at least structed HS, we will construct a set B> of minimal bad con- figurations describing minimal illegal configurations whose one cycle earlier than the other, while in other cases a hazard reachability (within possibly larger configurations) will mean appears even when they occur at the same time. More details that the given hazard case h does indeed lead to a hazard. on that are given below the definition, and an illustration is h h provided in Fig. 3. We define the set B> wrt an extended hazard system P> (defined later in this section), which is obtained by apply- Definition 18 The hazard distance δ : X×X → N is defined ing four transformations, described also later in the section, as follows for all hazards h = (χsp , χvi ) ∈ X × X where on the input system P h . Since the ordering of instructions χk = (πk, τk) for k ∈ {sp, vi}: within a hazard case h ∈ H is an important factor in the  following explanation, we will be speaking about pairs of τ lst − τ fst − 1 if h is a RAW or CTL hazard,  sp vi instruction classes consisting of an e (“earlier”) instruction fst lst δ(h) = τvi − τsp if h is a WAR hazard, and class and an ` (“later”) instruction class such that either e =  τ lst − τ lst − 1 if h is a WAW hazard. sp ∧ ` = vi or e = vi ∧ ` = sp, meaning that an earlier vi sp instruction always enters the pipeline sooner than the later Notice that the hazard distance is indeed always non- one. For the e- and `-class instructions, one of the following negative as the definitions of RAW and CTL hazard cases statements always holds: (a) For RAW and CTL hazards, fst lst (Definitions 11, 14) imply that τvi < τsp , and the defini- the e-class instruction is a spoiler that enters the pipeline tions of WAR and WAW hazards (Definitions 12, 13) imply first and should write data to be read by the later instruction, lst fst that τsp < τvi (and, for the case of WAW hazards, one can but it is too slow and the later victim instruction uses obso- fst lst add the fact that τvi < τvi ). For RAW and CTL hazard lete data. (b) For WAR and WAW hazards, the spoiler sp is cases, the distance is decremented by one because reading an `-class instruction that enters the pipeline later, but it is a value at a cycle when its writing was finished, which is too fast and it either destroys data to be read by the earlier what the corresponding value of τ records (recall that the victim instruction (WAR), or it stores its result too early and writing starts one cycle earlier), is safe. On the other hand, in the result is overwritten by the obsolete result of the earlier WAR hazards, overwriting the value that is read by the ear- instruction (WAW). To formalize the above for later use in lier instruction at the same time is an error. Finally, WAW this section, we define functions E, L: H → K so that E(h) hazards are special in that the conflict arises between two gives the meaning of the e class in a hazard case h ∈ H write operations where the most extreme case arises when while L(h) gives the meaning of the ` class. All these sce- the write operation in the spoiler appears one cycle before narios are summarized in Table 2. the write in the victim: that is why, we have the decrement h We are going to build the set B> such that it will contain by one in the formula of WAW hazards. 1 1 n n so-called hazard pairs qe q` , . . . , qe q` of states of the earlier We will next introduce the so-called spoiler/victim gap and later instruction such that a hazard described by the haz- and detection windows. Intuitively, the gap window gsp /gvi ard case h may occur iff there exists a configuration of the of a spoiler/victim instruction ι will tell us for how many cy- h system P> that contains as a subword some hazard pair from cles one has to wait within the execution of ι, starting from 20 L. Charvat´ · A. Smrckaˇ · T. Vojnar

푑푠푝

1 2 3 4 lst 5 휏푠푝

푔푣푖 푑푣푖 푔푠푝 푑푠푝

1 1 2 3 4 5 1 2 3 4 5 6 7 fst lst lst RAW 휏 휏 휏푠푝 푣푖 푣푖

푑푣푖 훿(ℎ) = 2 1 2 3 4 5 1 1 2 3 4 5 6 7 fst lst fst lst RAW 휏푣푖 휏푣푖 RAW 휏푣푖 휏푣푖

3 1 2 3 4 5 훿(ℎ) = 2 1 fst 2 3 4 5 lst 6 7 휏fst 휏lst RAW 휏푣푖 휏푣푖 OK 푣푖 푣푖

(a) Detection of a RAW hazard using a delay in the spoiler: (b) Detection of a RAW hazard using a delay in the victim: lst fst lst lst fst δ(h) = τsp − τvi − 1 = 4 − 1 − 1 = 2, gsp = τvi − δ(h) = τsp − τvi − 1 = 4 − 1 − 1 = 2, gsp = 0, lst lst lst τsp + 1 = 5 − 4 + 1 = 2, gvi = 0, dsp = δ(h) = 2, and gvi = τsp − τvi − 1 = 4 − 2 − 1 = 1, dsp = δ(h) = 2, dvi = 1. and dvi = 1.

푑푣푖

푑푣푖 1 2 3 4 5 lst 휏푣푖 1 2 3 4 5 6 fst lst 휏 휏푣푖 푣푖 푔푠푝 푑푠푝

푔푠푝 푑푠푝 1 1 2 lst 3 4 5 WAW 휏 1 1 2 lst 3 4 5 6 푠푝 WAR 휏푠푝

훿(ℎ) = 2 1 2 lst 3 4 5 훿(ℎ) = 2 1 2 lst 3 4 5 6 WAW 휏푠푝 WAR 휏푠푝

3 1 2 3 4 5 3 1 2 lst 3 4 5 6 lst OK 휏푠푝 OK 휏푠푝

(c) Detection of a WAR hazard using a delay in the spoiler: (d) Detection of a WAW hazard using a delay in the fst lst lst fst lst lst δ(h) = τvi − τsp = 4 − 2 = 2, gsp = τvi − τvi = spoiler: δ(h) = τvi − τsp − 1 = 5 − 2 − 1 = 2, gsp = 1, 6 − 4 = 2, gvi = 0, dsp = δ(h) = 2, and dvi = 1. gvi = 0, dsp = δ(h) = 2, and dvi = 1.

Fig. 3: An illustration of hazard distances together with gap and detection windows used to construct minimal bad sets. its critical write operation, until the detection of a possible must then be done in such a way that any hazard may be hazard may start. In some cases, the gap will be zero while detected with the detection windows defined as above, i.e., in some other cases it will be positive. The latter case will the detection within the particular instructions must be post- happen when the victim/spoiler instruction ι0, possibly col- poned such that the hazard can always be caught within the liding with ι, has no chance to perform its write operation detection windows. This definition is more complex and is before the moment when the write operation of ι happens given below separately for different types of hazards. even if ι0 starts right after ι. The detection window (of size at least one) will then tell us for how many cycles the detec- Gap Windows for RAW and CTL Hazards tion of a possible hazard should be performed within a given instruction after the gap window passes. First, notice that τ fst < τ lst holds for each forward execution fst lst In particular, we will define all the windows such that the (π, τ) ∈ X where π , π ∈ Vr. Second, recall that the detection window of victim instructions, denoted dvi , will be definitions of RAW and CTL hazard cases (Definitions 11, fst lst fixed to one, i.e., dvi = 1. Intuitively, the hazard detection 14) imply that τvi < τsp . If put together, one can see that fst lst lst will always be performed as soon as the victim instruction there are two possible orderings of τvi , τvi , and τsp : writes (and hence “publishes”) the wrong data and the gap fst lst lst window of that instruction is over. τvi < τsp ≤ τvi (40) The detection window of a spoiler instruction will be fst lst lst τvi < τvi < τsp (41) possibly longer, in particular, it will correspond to the haz- ard distance, i.e., dsp = δ(h) where h = (χsp , χvi ) is the We start with the ordering (40), which is illustrated by considered hazard case. The definition of the gap windows the scenarios in Fig. 3(a). In this case, the spoiler finishes Utilizing Parametric Systems For Detection of Pipeline Hazards 21 its write operation earlier, and the RAW hazard occurs as ter the spoiler instruction writes, the WAW hazard does not soon as the victim performs its write operation. Hence, in occur until the victim performs its write as well. This can- order to be able to detect the hazard via states simultane- not happen sooner than after passing through at least one ously reached in the spoiler and the victim, the detection pipeline stage. Therefore, we put the spoiler gap distance needs to be put off in the spoiler. Provided that the we con- equal to one and the victim gap distance equal to zero, i.e., lst lst sider a victim that starts right after the spoiler, τvi − τsp + 1 gsp = 1 and gvi = 0. cycles need to be skipped in the spoiler (including the cycle in which the write operation of the spoiler happens), and so Tracking Passage through Gap and Detection Windows lst lst 10 gsp = τvi − τsp + 1. On the other hand, no cycles need to be skipped before the detection starts in the victim, and To facilitate tracking whether a spoiler/victim instruction is so gvi = 0. Note that the detection of hazards with victims inside a gap or detection window and, if so, how far inside that start later than one cycle behind the spoiler is handled the window it is, we will introduce a notion of extended haz- through the detection window dsp . ard systems (EHS). In an EHS, each state of the execution of Next, we consider the ordering (41), which is illustrated a spoiler/victim instruction will be labelled by a set of tags in Fig. 3(b). In this case, the victim performs the write op- saying whether the write operation of the spoiler/victim has eration first, and the hazard occurs as soon as the spoiler already happened and, if so, how many cycles have passed performs its write operation. Hence, this time, the detection since then. The universe of tags T will therefore include all needs to be put off in the victim. Using a similar reasoning couples from the set {winsp , winvi } × N. The universe of lst lst 11 as above, we define gsp = 0 and gvi = τsp − τvi − 1. tags is, however, not defined to be equal to the above set since we will need to add some more tags into it later on Gap Windows for WAR Hazards when we examine the effect of stalling of an instruction, which we will need to reflect in the tags as well. We defer For an illustration of the gap and detection windows of WAR the discussion of the stalling-related tags after we properly hazards, see Fig. 3(c). As above, we can use the fact that explain the basic spoiler/victim tags. τ fst < τ lst holds for each forward execution (π, τ) ∈ X Below, we will introduce the EHSs step-wise by first fst lst where π , π ∈ Vr. Moreover, the definition of WAR haz- adding tracking of spoiler windows, then victim windows, lst fst ards (Definition 12) implies that τsp < τvi . Hence, for WAR and then adding tracking of stalled instructions. This will fst lst lst hazards, τvi , τvi , and τsp can be ordered as follows only: lead to introduction of EHSs of various levels, with the zero level being the original hazard system, level one being the τ lst < τ fst < τ lst (42) sp vi vi extension by tracking spoilers, etc. Intuitively, after the spoiler instruction writes, the WAR More formally, for a hazard case h = (χsp , χvi ) and hazard does not occur until the victim performs its write as the associated HS P h = (Q h , ∆h , αh ), the correspond- well. Unlike for RAW/CTL hazards, we now consider as the ing extended hazard system (EHS) of level n ≥ 0 is a tuple h h h h h base case not the situation when the later instruction starts Pn = (Qn , ∆n, αn, βn) where: right after the earlier, but the case when the later instruction 1. Q h is a finite subset of the set Q h × ( ∪ {⊥, >})n.12 starts as late as possible to be still able to cause a hazard, n N We let Q h = Q h , and we give the precise construc- i.e., the case when the spoiler starts δ(h) cycles after the 0 tion of the set Q h for n ≥ 1 below. Intuitively, the ad- victim. Then, it is easy to see that the detection needs to n ditional components of the states will allow us to track be put off by τ lst − (τ lst + δ(h)) cycles. Hence, we define vi sp the passage of the spoiler/victim instructions through the g = τ lst−(τ lst +δ(h)) = τ lst−(τ lst +τ fst−τ lst) = τ lst−τ fst sp vi sp vi sp vi sp vi vi gap and detection windows, for which some states of the while g = 0. The cases of the spoiler that start sooner are vi original HS will need to be split to multiple occurrences then handled appropriately by using the detection window to reflect whether an instruction in that state is in the d = δ(h) as also illustrated in Fig. 3(c). sp window and, if so, how far. Moreover, some further split- ting will be needed when some of the tracked instruc- Gap Windows in WAW Hazards tions are stalled some number of times. The finiteness Q h As with WAR hazards, for WAW hazards, the ordering be- of n will stem from that the tracked gap and detec- tween writes given in Equation 42 is the only possible. Af- tion windows are finite, that we are tracking a pair of instructions, and that the stalling can happen for finite 10 Intuitively, the addition of 1 is needed since the victim starts by time only. one cycle later. Further, note that the gap is appropriately defined also lst lst 12 h for the case when τsp = τvi when a gap window of size 1 is needed to For convenience, by a slight abuse of the notation, we let (Q × compensate the fact that the victim starts by one cycle later. (N∪{⊥, >}))×(N∪{⊥, >}) = Q h ×(N∪{⊥, >})×(N∪{⊥, >}) 11 h The subtraction of 1 comes from that the spoiler starts by one cycle and ((q, i1), i2) = (q, i1, i2) for any q ∈ Q and i1, i2 ∈ N ∪ earlier. {⊥, >}, and likewise for higher values of n. 22 L. Charvat´ · A. Smrckaˇ · T. Vojnar

h h 2. The transition relation ∆n and the labelling function αn Algorithm 2 The window procedure transforming an EHS h h h lift the transition relation ∆ and the labelling function Pn to an EHS Pn+1 to facilitate tracking of the execution h h h α to the extended set of states. We have ∆0 = ∆ and of an instruction that performs a critical write operation w h h α0 = α , and the construction of the relations for n ≥ 1 in a state from some given set S through a window of some is described in detail below. given length k. h h T 3. Finally, β : Q → 2 is the new tag function. We let h h h h h n n Require: An EHS Pn = (Qn, ∆n, αn, βn) of any level n ≥ 0, h h h β0 (q) = ∅ for any q ∈ Q0 . For n ≥ 1, the construction a set S ⊆ Qn of states to start the transformation from, a tag w ∈ of the function will also be shown below. {winsp , winvi }, and the length of the tracking window k ∈ {1, ..., max(S)}. h h h h h h For n ≥ 1, the construction of the EHS Pn will be Ensure: An EHS Pn+1 = (Qn+1, ∆n+1, αn+1, βn+1) where based on applying Alg. 2 and 3 several times on the EHS each state based on q ∈ S together with its k reachable successors h is tagged by a pair (w, i) where 0 ≤ i < k denotes the distance P0 . We start by presenting Alg. 2 that implements a pro- of the successor from the original occurrence of q. cedure denoted as window. Given a set S of starting states, 1: I := {⊥, >, 0, . . . , k − 1}. this procedure extends the input EHS such that it allows for h h 2: Qn+1 := Qn × I. tracking a spoiler/victim instruction, which performs some h 3: ∆n+1 is defined as the minimal relation such that the following critical operation w (such as a write to its target register) in two conditions hold: 0 h a state q ∈ S, and then passes through the tracked gap and (a) For every global transition Q◦ : G |= q1 → q2 ∈ ∆n and h detection windows whose combined length is k. Here, note for every injection Γ : Qn → I, the following transitions are in ∆h : that we monitor the gap and detection windows joint into n+1 one window which is possible since the latter follows im- – Q◦ : Γb(G) |= (q1, ⊥) → (q2, ⊥), mediately after the former (and we can distinguish in which – Q◦ : Γb(G) |= (q1, ⊥) → (q2, 0) if q2 ∈ S, of the original windows we are by just looking at how deep – Q◦ : Γb(G) |= (q1, i) → (q2, i + 1) for all 0 ≤ i < k − 1, into the combined window we are). – Q◦ : Γb(G) |= (q1, i) → (q2, >) for i = k − 1, – : Γ (G) |= (q , >) → (q , >) Intuitively, the algorithm extends all states of the input Q◦ b 1 2 h h Q Qn+1 0 h EHS by one more component that ranges over the set I := where Γb : 2 n → 2 is defined such that ∀Q ⊆ Qn : 0 0 {⊥, >, 0, . . . , k − 1}. When the additional component is ⊥, Γb(Q ) := {(q, Γ (q)) | q ∈ Q }. h the tracked instruction has not yet entered the gap/detection (b) For every local transition q1 → q2 ∈ ∆n, the following tran- sitions are in ∆h : window. If the additional component i is from the set {0,..., n+1 k − 1}, the instruction is in the tracking window for i + 1 – (q1, ⊥) → (q2, ⊥), – (q , ⊥) → (q , 0) if q ∈ S, cycles. If the additional component is >, the instruction has 1 2 2 – (q1, i) → (q2, i + 1) for all 0 ≤ i < k − 1, already got out of the window. – (q1, i) → (q2, >) for i = k − 1, The transition relation is updated straightforwardly such – (q1, >) → (q2, >). that the monitoring phase can be entered whenever an in- h h h 4: ∀(q, i) ∈ Qn × I : αn+1(q, i) = αn(q). struction is in some state from the given set S (and the mon- h h h 5: ∀(q, i) ∈ Qn × {⊥, >} : βn+1(q, i) = βn(q). itoring has not yet started). If the monitoring is started, every h h h 6: ∀(q, i) ∈ Qn × {0, . . . , k − 1} : βn+1((q, i)) = βn(q) ∪ executed transition increases the number of cycles spent in {(w, i)}. the window (recorded in the additional component of states) until the end of the window is reached. Note that, for transi- tions with guards, the states used in the guards must be lifted P h = (Q h to the new set of states, which is done by allowing them to the following auxiliary mapping. Given an EHS n n , h h h Qh ∆ , α , β ) of level n, we define wr h : × → 2 n appear with any value of the additional component. Indeed, n n n Pn K X satisfaction of the guard is not subject to the cycle in which as the function that maps a class κ ∈ K and any execu- h 0 it is reached. tion (π, τ) ∈ X to the set {q ∈ Qn | q = hκ , s, i1,..., 0 lst h The α function does not depend on the additional com- ini ∧ κ = κ ∧ s = τ(π )} of all the states of Pn where lst ponent, and so it is lifted to the new set of states by ignoring a κ-class instruction makes the write to its target register π the additional component. On the other hand, the β func- in the execution (π, τ). From the definition of the execution, lst tion is extended such that states that are inside the monitored we know that such a write occurs in the stage τ(π ). window will be tagged by a couple (w, i), which says that We can now proceed to the transformation of the origi- h h the operation w is in the (i + 1)-th cycle of its gap/detection nal P0 to the EHS P1 that is extended to track the spoiler window. gap and detection windows. With the above notation and al- h Now, before we can apply the window transformations, gorithm in hand, the EHS P1 can be obtained simply as: we need to identify the set S of states where the critical write operations happen and the tracking of the passage of h  h  P := window P , wr h (sp, χsp ), winsp , gsp + dsp . the gap/detection windows starts. Therefore, we introduce 1 0 P0 Utilizing Parametric Systems For Detection of Pipeline Hazards 23

Indeed, the critical operation is writing in a spoiler, which 푛푓1 푛푓2 ... we denote as winsp . The write operation can happen in one 퐺푐푙 of the states returned by wr h (sp, χsp ). These states thus : P0 ∃↔ serve as the initial states for tracking the gap and detection ... 푠푝0 푠푝1 푠푝2 ... windows. Their sizes are gsp and dsp , respectively, which gives the length gsp + dsp of the combined window whose h ∃↔:퐺푠푡 tracking is ensured in P1 by Alg. 2. h (a) A part of an EHS P h = (Q h , ∆h , αh , βh ) modeling the be- The EHS P2 extended to track the victim gap and de- n n n n n h havior of a spoiler instruction before an application of the window tection windows can be obtained from P1 in a very similar procedure. Note that the spoiler instruction in the state sp0 might way as follows: stall (if there are instructions from the set Gst ), be cleared (if there are instructions from the set Gcl ), or proceed to the next stage (repre- h  h  P2 := window P1 , wr P h (vi, χvi ), winvi , gvi + dvi . sented by the state sp1). The different styles of the lines representing 1 global transitions are used to allow for better matching with the coun- An example of a computation of the tracking window is terparts of the transitions in Fig. 4 (b). demonstrated in Fig. 4. 1 ⊤ 푛푓1 푛푓2 ... Tracking Windows in Stalled Instructions

Since our approach builds on counting the exact number of ... 푠푝0 푠푝1 푠푝⊤ ... cycles spent within the tracking windows, we also need to 0 1 2 deal with any scenario when an `-class instruction is stalled while the corresponding e-class instruction is not.13 This Legend: 1 ⊤ scenario breaks the counting scheme introduced in the pre- 푠푝0 푛푓1 ˆ vious paragraphs as the later instruction can get delayed and ∃↔:Γ푖(퐺푠푡) the earlier instruction might get out of the detection window ˆ before the later one gets into its detection window. The goal 푠푝⊤ 푠푝⊤ ∃↔:Γ푖(퐺푐푙) of the following transformations is to compensate such mis- 0 1 alignments by (1) using so-called slack tags to count how h h many times the later instruction gets stalled and (2) by ex- (b) A part of the EHS Pn+1 = window(Pn, {sp0}, winsp , 2) that h j panding the detection window of the earlier instruction cor- correponds to the same part of Pn depicted in Part (a). States spi , 0 ≤ i ≤ 2 0 ≤ j < 2 ( , j) ∈ βh (spj ) respondingly. , , for which winsp n i , are high- lighted in red. Please note that each global transition from the orignal The introduction of slack tags, which are drawn from h EHS Pn corresponds to a family of transitions given by all possible the set {sl} × , is implemented in Alg. 3, which takes us h N injections Γ1,...,Γk : Qn → {⊥, 0, 1, >} with the mappings Γbi, P h 0 h 0 from the EHS 2 obtained by the previous transformations 1 ≤ i ≤ k, defined such that ∀Q ⊆ Qn : Γbi(Q ) := {(q, Γi(q)) | h 0 to EHS P3 as follows: q ∈ Q }. These families of transactions are denoted by the dashed and dotted lines in the figure. In particular, the dashed lines are used h h P3 := slack(P2 , max(S)). for the ∃← : Γci(Gst ) family. The dotted lines then correspond with the ∃← : Γci(Gcl ) family. h Intuitively, all states from the EHS P2 are considered to have the initial slack zero. Then, whenever a self-loop on Fig. 4: An illustration of an application of the window pro- any such state is possible, the self-loop is changed into a tran- cedure on a fragment of an EHS P h . sition going to a new copy of the concerned state with the n slack being one. More generally, a self-loop on a state with the slack being i is transformed into a transition to a new copy of that state with the slack being i+1 (unless the num- access. An example of an application of the slack mapping ber of slack steps reaches the maximum number of pipeline is demonstrated in Fig. 5. stages—going to such a number and beyond is not neces- What remains to be done is to adjust the tracking win- sary since such behaviours are ruled out by the initial sanity dow of the earlier instruction, which has to be done such that checks). The number of stalls (slack transitions) performed the extension corresponds to the number of the slack transi- by an instruction is thus remembered in the structure of the tions taken by the later instruction. For that, we will again states, and, in addition, we add it into the tags of the states use the window procedure from Alg. 2, but we will instruct at the end of Alg. 3 so that the slack information is easier to (i) it to add special tags of the form winvi/sp meaning that the 13 The converse cannot happen due to the basic consistency checks tracking window of the earlier instruction is extended by i that we perform. cycles. The definition of the bad configurations will then 24 L. Charvat´ · A. Smrckaˇ · T. Vojnar

Algorithm 3 A procedure for computing the slack mapping. ... 푠푝0 푠푝1 푠푝2 푠푝3 ... h h h h h Require: An EHS Pn = (Qn, ∆n, αn, βn) of any level n ≥ 0 and m ≥ 1 the total number of pipeline stages . ′ h h h h h ∃↔:퐺푠푡 ∃↔:퐺푠푡 Ensure: An EHS Pn+1 = (Qn+1, ∆n+1, αn+1, βn+1) whose h (a) A part of an EHS P h modeling the behavior of a spoiler in- states Pn+1 are tagged by pairs (sl, i) where 0 ≤ i < m de- n notes the number of self-loop transitions taken by the later tracked struction before an application of the slack mapping. Note that the instruction in the EHS P h . spoiler instruction in the states sp1 and sp2 might be stalled (if there n 0 1: I := {>, 0, . . . , m − 1}. are instructions from the set Gst , resp. Gst ). The different styles of the lines representing global transitions are used to allow for better 2: Q h := Q h × I. n+1 n matching with the counterparts of the transitions in Fig. 5 (b). h 3: ∆n+1 is defined as the minimal relation such that the following two conditions hold: 푠푝0 푠푝0 푠푝0 푠푝0 h ... 0 1 2 3 ... (a) For every global transition Q◦ : G |= q1 → q2 ∈ ∆n and h for every injection Γ : Qn → I, the following transitions are in h ∆n+1: Legend: 푠푝1 푠푝1 푠푝1 ... – Q◦ : Γb(G) |= (q1, i) → (q2, i + 1) if q1 = q2 for all 1 2 3 0 ≤ i < m − 1, ˆ ∃↔:Γ푖(퐺푠푡) – Q◦ : Γb(G) |= (q1, i) → (q2, i) if q1 6= q2 for all 0 ≤ i < m, ⊤ ⊤ ⊤ ˆ 푠푝 푠푝 푠푝 ... ∃↔:Γ (퐺′ ) 1 2 3 – Q◦ : Γb(G) |= (q1, i) → (q2, >) if q1 = q2 and i = m − 1, 푖 푠푡 – Q◦ : Γb(G) |= (q1, >) → (q2, >) h h Qh Qh 0 h (b) A part of an EHS Pn+1 = slack(Pn, 2) that correponds to where Γb : 2 n → 2 n+1 is defined such that ∀Q ⊆ Q : n the same part of P h depicted in Part (a). States spj , 0 ≤ i ≤ 3, Γb(Q 0) := {(q, Γ (q)) | q ∈ Q 0}. n i 0 ≤ j < 2, for which (sl, j) ∈ βh (spj ) with the same value of h n i (b) For every local transition q1 → q2 ∈ ∆n, the following tran- j, indicating that the instructions passed the same number of self- h sitions are in ∆n+1: loops, share the same color. Please note that each global transition h – (q1, i) → (q2, i + 1) if q1 = q2 for all 0 ≤ i < m − 1, from the orignal EHS Pn corresponds to a family of transitions given h – (q1, i) → (q2, i) if q1 6= q2 for all 0 ≤ i < m, by all possible injections Γ1,...,Γk : Qn → {⊥, 0, 1, >} with 0 h – (q1, i) → (q2, >) if q1 = q2 and i = m − 1, the mappings Γbi, 1 ≤ i ≤ k, defined such that ∀Q ⊆ Qn : (q , >) → (q , >) 0 0 – 1 2 . Γbi(Q ) := {(q, Γi(q)) | q ∈ Q }. Analogically to Fig. 4, these h h h families of transactions are denoted by the dashed and dotted lines in 4: ∀(q, i) ∈ Qn × I : αn+1(q, i) = αn(q). the figure—the dashed lines are used for the ∃← : Γi(Gst ) family h h h c 5: ∀(q, i) ∈ Qn × {>} : βn+1(q, i) = βn(q). 0 while the dotted lines correspond with the ∃← : Γci(Gst ) family. h h h 6: ∀(q, i) ∈ Qn × {0, . . . , m − 1} : βn+1((q, i)) = βn(q) ∪ {(sl, i)}. Fig. 5: An illustration of an application of the slack mapping h on a fragment of an EHS Pn . (i) match states of the earlier instruction tagged by winvi/sp with winsp/vi -tagged states of the later instruction that are rations whose reachability (within possibly larger configu- at the same time tagged by such sl tags which show that the rations) will mean that the given hazard case h does indeed later instruction went through i slack transitions more than h lead to a hazard. It now remains to define the set B> along the earlier one. with the corresponding set of initial configurations between With all the notation at hand, it is now easy to derive the which reachability will have to be checked. h EHSs P3+i of levels 3 + i for 1 ≤ i ≤ m with m = max(S) h We first define the regular set I> of initial configurations being the maximum number of pipeline stages that extend h of P> that consists solely of instructions in the state ⊥, i.e., the tracking window of the earlier instruction by i cycles. before entering the pipeline. An initial configuration may For i iterating from 1 to m, we get be of an arbitrary length, and it may contain exactly one h  h spoiler sp and one victim instruction vi, interleaved by any P3+i := window P3+i−1, wr P h (κ, χκ), 3+i−1 other instructions in any order, modelled using the nf class. (i)  Formally, the set I h of the initial states of EHS P h is defined winκ , gκ + dκ + i > > as follows h h I h := I h ∪ I h where κ = E(h). As the final step, we put P> := P3+m. > 1 2 where Initial and Bad Configurations h ∗ ∗ ∗ I1 := {hnf , ⊥i} {hvi, ⊥i}{hnf , ⊥i} {hsp, ⊥i}{hnf , ⊥i} h Above, we have finished the construction of the EHS P> h and designed to facilitate the construction of the set B> of min- h ∗ ∗ ∗ imal bad configurations describing minimal illegal configu- I2 := {hnf , ⊥i} {hsp, ⊥i}{hnf , ⊥i} {hvi, ⊥i}{hnf , ⊥i} . Utilizing Parametric Systems For Detection of Pipeline Hazards 25

h h h Next, we define the set B> of minimal bad configura- With the EHS P> and the sets of initial I> and minimal h tions that describe hazardous configurations. The main chal- bad configurations B> at hand, checking whether the haz- h lenge behind the construction of B> is to correctly match de- ard h is feasible reduces to checking whether there is some h tection states of the earlier and later instructions. For that, we configuration in B> that is reachable from some configura- h will use the tracking mechanism that we have provided by tion in I>, for which one can use techniques described, e.g., h the winsp/vi tags. Namely, we will construct B> to include in [2,4]. h all configurations that contain pairs of states qe , q` ∈ Q>, consisting of a state qe of an e-class instruction and a state q` of an `-class instruction, for which the win tags corre- 8 Experimental Evaluation spond to the detection part of the tracking window. This is, h h we will consider the states qe = hκe ,...i ∈ Q> obeying We have implemented the above described method in a pro- totype tool called Hades [10]. Hades is written in C++ com- h β (qe ) ∈ {(win h , i) | g h ≤ i < g h + d h }, > κe κe κe κe bined with Python and consists of several components de- picted in Fig. 6. The tool first reads an RTL description of h h h for κe = E(h), and, dually, the states q` = hκ` ,...i ∈ Q> the processor to be verified and converts it into its internal satisfying PSG representation. Currently, Hades supports the RTL for-

h mat expressed in CodAL which is an architectural descrip- β (q`) ∈ {(winκh , i) | gκh ≤ i < gκh + dκh }, > ` ` ` ` tion language used in the processor design IDE [15]. For other RTL languages like VHDL and Verilog where archi- h for κ` = L(h). tectural registers are not explicitly identified, a list of archi- It now remains to deal with situations when some of the tectural registers with an explicit identification of the pro- instructions are stalled. This is monitored using the sl tags. gram counter must be provided. First, we can observe that we do not have to further elaborate The obtained PSG representation is then normalised and cases when both (earlier and later) instructions are stalled simplified. This step includes, for instance, a replacement together. Clearly, any hazard that would occur after these of conditional branching by multiplexors, an application of cases would also occur in the case when the instructions are value propagation, and a removal of redundant nodes and not stalled. Second, the case when the earlier instruction is edges. The normalisation is done using an internal compo- stalled while the later is not is excluded by the consistency nent of Hades called as the RTL query engine (RQE), which of the pipeline. Therefore, it suffices to only consider those allows one to search for data-paths and substitute parts of states of the earlier instruction qe for which (sl, 0) ∈ β(qe ). the microprocessor RTL design described via a PSG. Subse- Next, let i be a counter that increases each time the later quently, pipeline stages are identified by the data-flow anal- instruction is stalled while the earlier one is not. Since the ysis discussed in Section 5.1. Next, pipeline consistency is consistency Rules 1–4 from Section 5.2 guarantee that each checked using Rules 1–4 from Section 5.2 by an SMT solver instruction leaves the pipeline in a final number of steps, the for bit-vector logic. Hades is compatible with all SMT solvers value of the counter i may only range from 0 to max(S). accepting the SMT2 formula format. In particular, for the Every time the counter i is increased, the detection in the below experiments, recent versions of (v4.8.8) [21] and earlier instruction is postponed by a single pipeline cycle. CVC4 (v1.9) [3] were used. Further, after the PSG is anno- h Taken all together, the set B> of minimal bad configura- tated by pipeline stages identified by the data-flow analysis, tions describing hazardous configurations is defined as Hades repeatedly utilizes the RQE and the SMT solver to ex- tract potential hazard cases as described in Section 6 and to max( ) [S generate the appropriate hazard systems (HSs) for each haz- B h := B h (43) > i ard case as we have seen in Section 7. The generated HSs i=0 are then checked using the abstract regular model checker where (ARMC) of [4]. The process of evaluation of the inputs and generation of the results by the above mentioned subsystems h (i) h Bi := {qe q` | {(sl, 0), (win h , i + j)} ⊆ β>(qe ) ∧ is orchestrated by the so-called “core” component of Hades. κe h We have tested the tool on the following processors14: {(sl, i), (winκh , k)} ⊆ β>(q`) ∧ ` TinyCPU is a small 8-bit processor with 3 pipeline stages gκh ≤ j < gκh + dκh ∧ e e e (44) that we use mainly for testing new verification methods. gκh ≤ k < gκh + dκh ∧ ` ` ` 14 Most of the processors are available together with our tool. How- h h h qe = hκe ,...i ∈ Q> ∧ κe = E(h) ∧ ever, SPP8 and Codea2 were provided by the Codasip company [15] h h h and are not publicly available at the time of writing this article— q` = hκ` ,...i ∈ Q> ∧ κ` = L(h)}. likewise for SPP16, which we derived from SPP8. 26 L. Charvat´ · A. Smrckaˇ · T. Vojnar

Table 3: Experimental results obtained using the Z3 SMT solver on processors whose names are given in the table.

Name (Stages) Simpl. Data Flow Consistency Parametric System Total Hazard Variant Time [s] Analysis [s] Checking [s] Generation and Verification [s] Time [s] Cases [#] rqe smt core rqe smt armc core pot. real TinyCPU (3) SCR 0.02 0.01 <0.01 0.54 0.30 0.01 1.11 12.23 2.29 16.51 6 0 SACRWV 0.03 0.01 <0.01 0.63 0.36 0.03 3.29 32.36 6.35 43.06 10 1 BCR 0.01 0.01 <0.01 0.50 0.26 0.01 1.09 10.20 2.54 14.62 6 0 BACRWV 0.01 0.01 <0.01 0.66 0.37 0.03 3.48 32.35 8.07 44.98 10 1 SFCR 0.02 0.01 <0.01 0.51 0.32 0.02 2.52 31.16 4.70 39.26 14 0 SFACRWV 0.02 0.01 <0.01 0.68 0.46 0.05 6.50 81.16 13.27 102.15 22 1 SPP (3) SCR 0.07 0.01 0.01 0.83 0.59 0.04 5.24 32.67 9.73 49.19 29 0 BCR 0.06 0.01 0.01 0.82 0.60 0.04 5.72 32.68 14.16 54.10 29 0 SPP16 (3) SCR 0.07 0.01 0.01 0.94 0.63 0.04 5.51 32.39 9.93 49.53 29 0 BCR 0.07 0.01 0.01 0.90 0.64 0.04 6.03 32.86 14.53 55.09 29 0 Codea2 (4) SFCR 0.19 0.04 0.01 1.19 0.67 0.33 74.96 247.17 128.48 453.04 243 4 CompAcc (4) SFACRWV 0.06 0.01 0.01 1.14 0.72 0.13 28.36 248.78 36.52 315.73 44 0 BFACRWV 0.07 0.02 0.01 1.14 0.67 0.16 34.97 249.45 49.99 336.48 59 0 DLX5 (5) SFCR 0.10 0.03 0.01 2.18 1.32 0.11 24.03 189.67 42.58 260.03 27 1 SFACRWV 0.10 0.03 0.01 2.36 1.42 0.20 46.82 292.63 81.04 424.61 62 2 BFCR 0.10 0.04 0.01 2.29 1.31 0.13 57.79 190.03 143.21 394.91 27 1 BFACRWV 0.12 0.02 0.01 2.34 1.43 0.23 91.20 292.95 361.14 749.44 62 2

S Stalling Logic B Bypassing Logic F Flag Register(s) A Auto-increment Logic C CTL Hazards R RAW Hazards W WAR Hazards V WAW Hazards

RTL Stage PSG + list of PSG cial registers, a flag register, and an instruction set including identification with stages architectural storages 41 instructions where each may use up to 4 available ad- dressing modes. CompAcc is an 8-bit processor with 4 stages SMT Consistency RTL ARMC solver checks query engine that is based on an accumulator architecture with a very sim- ilar structure as the one shown in Fig. 2. Finally, DLX5 is HADES Potential Parameterized core hazard cases systems a 5-staged 32-bit processor able to execute a subset of the instruction set of the DLX architecture [24] (with no float- ing point instructions). Fig. 6: A schematic of the Hades verification tool. We consider multiple variants of the above introduced processors, which gives us 17 unique test cases in total. In SPP8 is a 3-staged 8-bit ipcore with 16 general-purpose reg- particular, the variants of the particular processors differ in isters and a RISC instruction set consisting of 9 instructions. the following aspects: (i) the way how data hazards are avoid- SPP16 is a 16-bit variant of the previous processor with ed (pipeline stalling and clearing or data bypassing), (ii) the a more complex memory model (allowing, for instance, un- presence of flag/status registers, and (iii) utilization of the aligned access). Codea2 is a 16-bit processor with 4 pipeline auto-increment logic. stages dedicated for signal processing applications. The pro- We conducted a series of experiments on a PC with Intel cessor is equipped with 16 general-purpose registers, 15 spe- Core i7-3770K @3.50GHz and 32 GB RAM whose results Utilizing Parametric Systems For Detection of Pipeline Hazards 27 are shown in Table 3. In these experiments, we used Hades Finally, the two sub-columns of the last column give backed by the Z3 solver. The first column gives the name (i) the number of potential data and control hazard cases of the verified processor and its variant. The specifics of that had to be checked and (ii) the number of cases that each variant typically influence the types of potential haz- were proven to be real, respectively. Note that each hazard ards that may occur. Therefore, we have also encoded the case represents a separate task, and so the generation and hazard types into the variant’s name (e.g., “C” for “contains verification of the parametric systems can be parallelized potential control hazards”—we give a full list of the codes in the future. Even though the number of hazard cases is used below the table). The second and third columns give higher when compared to our previous work [12] (because the time needed for the PSG simplification and its data flow of control hazards), the runtimes (in Table 3) have improved analysis. The fourth and fifth columns (both split to a num- for most of the columns. A majority of this improvement is ber of sub-columns) then give the duration of the consis- caused by an implementation of a mechanism that efficiently tency checking and the time spent by verification of the para- pre-computes and reuses the formulae F1, F2, and F3 from metric systems that are created for each hazard case. In the Eq. 12. A drawback of this approach (compared to the one sub-columns, the times given in the fourth and fifth columns used in [12]) is a higher utilization of the SMT solver, which are split to the times consumed by the different parts of our is, however, compensated by the lower verification time con- tool’s architecture. sumed by the Hades core component. The sixth column provides the overall verification time, During the experiments, we identified a flaw in the RAW which remains in the order of minutes even for complex de- hazard resolution when accessing the data memory in a de- signs. Moreover, the tool also scales well with the growing velopment version of the SPP8 processor. Further, compared width of the processor data-path as can be seen by compar- to our previous results from [12], we have extended our ing the times obtained for SPP8 and SPP16. TinyCPU and DLX5 processor models so that they can now handle a majority of control hazards. The different number From the obtained data, we further see that the verifi- of potential hazard cases is a side-effect of this change. All cation time grows with the number of stages of the proces- the remaining real hazard cases (shown in the last column) sor. Naturally, such an increase comes from the fact that, are control hazards. in CPUs with deeper pipelines, longer spoiler/victim execu- Notice that each processor with the auto-increment logic, tions are likely to occur. We have not analysed this growth which typically resides in early pipeline stages, contains at from a complexity-theoretic point of view, but our experi- least one instance of a control hazard. The auto-increment ments show that the average verification time required for logic is a feature of addressing modes where a register used verification of a single hazard case approximately doubles for computing a memory address is incremented once its with each additional stage—it is 2.86 sec. for the considered value is obtained. Such an increment is not guarded even if processors with 3 pipeline stages, 4.91 sec. for the 4-staged the instruction itself will be discarded in later stages, e.g., ones, and 10.8 sec. for the 5-staged ones15. due to an unsuccessful . This is not Since the current Hades implementation relies on ex- considered a design error; instead, software compilers are porting formulas in the smt2 file format, the tool does not instructed not to place instructions with an auto-increment depend on any particular SMT solver. We took this oppor- near jumping instructions (they resolve the problem by ex- tunity and conducted the same set of experiments with the plicitly generating a series of NOP instructions after a condi- CVC4 solver too where we observed a similar pattern in tional branch). A similar situation occurs in the DLX proces- which the average verification time per hazard case dou- sor where the fourth pipeline stage, designated for memory bles with each added stage. We have also noticed that for accesses, writes to memory whenever a store instruction is certain tested designs (especially the less complex ones like executed even if the branch condition of an earlier instruc- TinyCPU) the CVC4 solver delivers results approximately tion is not yet evaluated (therefore the memory access can- 5-10 % faster than Z3. Therefore, utilization of a portfo- not be prevented). This situation adds an extra occurrence of lio solver, which would run multiple SMT solvers in par- a control hazard and must also be taken into account when allel and report the answer that is first returned by any one compiling a program. Finally, in Codea2, the logic that deals of them, should lead to even better verification times. How- with control hazards is left out intentionally (by design), ever, we do not consider extensive benchmarking with SMT again relying on the compiler to take this into account. solvers a goal of this article.

9 Conclusion 15 The considered processors differ from each other not only in the length of the pipeline but also in the overall structural complexity. We have presented an approach that harnesses methods for However, the overall structural complexity is bigger for those pro- cessors that have a longer pipeline. Hence, the influence of solely the formal verification of parametric systems in order to dis- length of the pipeline can be even smaller than two-fold. cover incorrectly handled data and control pipeline hazards 28 L. Charvat´ · A. Smrckaˇ · T. Vojnar in the RTL implementation of pipeline-based execution. The 14. CodAL architecture description language. www.codasip. approach was developed with the aim to be highly auto- com/custom-processor, 2019. www. mated, not requiring any additional efforts from the devel- 15. Codasip Studio for rapid processor development. codasip.com, 2019. opers (apart from specifying the architectural registers). We 16. K. Hao, S. Ray, and F. Xie. Equivalence checking for function have implemented the approach and successfully tested it pipelining in behavioral synthesis. In Proc. of Design, Automation on several non-trivial microprocessors where the approach and Test in Europe (DATE’14), pages 1–6. IEEE, 2014. was able to discover previously unknown flaws caused by 17. R. B. Jones, C. H. Seger, and D. L. Dill. Self-consistency check- ing. In Proc. of Formal Methods in Computer-Aided Design (FM- unhandled hazards. CAD’96), volume 1166 of LNCS, pages 159–171. Springer, 1996. 18. A. Koelbl, R. Jacoby, H. Jain, and C. Pixley. Solver technology for system-level to RTL equivalence checking. In Proc. of Design, Acknowledgements This work was supported by the Czech Science Automation and Test in Europe (DATE’09), pages 196–201. IEEE, Foundation under the project 20-07487S. 2009. 19. U. Kuhne, S. Beyer, J. Bormann, and J. Barstow. Automated formal verification of processors based on architectural models. References In Proc. of Formal Methods in Computer-Aided Design (FM- CAD’10), pages 129–136. IEEE, 2010. 1. M. D. Aagaard. A hazards-based correctness statement for 20. P. Mishra, H. Tomiyama, N. Dutt, and A. Nicolau. Automatic veri- pipelined circuits. In Proc. of Correct Hardware Design and Ver- fication of in-order execution in microprocessors with fragmented ification Methods (CHARME’03), volume 2860 of LNCS, pages pipelines and multicycle functional units. In Proc. of Design, 66–80. Springer, 2003. Automation and Test in Europe (DATE’02), pages 36–43. IEEE, 2. P. A. Abdulla, F. Haziza., and L. Hol´ık. All for the price of few 2002. (parameterized verification through view abstraction). In Proc. of 21. L. De Moura and N. Bjorner. Z3: An efficient SMT solver. In Verification, Model Checking, and Abstract Interpretation (VM- Proc. of International Conference on Tools and Algorithms for the CAI’13), volume 7737 of LNCS, pages 476–495. Springer, 2013. Construction and Analysis of Systems (TACAS’08), volume 4963 3. Clark Barrett, Christopher L. Conway, Morgan Deters, Liana of LNCS, pages 337–340. Springer, 2008. Hadarean, Dejan Jovanovi’c, Tim King, Andrew Reynolds, and 22. K. S. Namjoshi. Symmetry and completeness in the analysis of Cesare Tinelli. CVC4. In Proceedings of the 23rd International parameterized systems. In Proc. of Verification, Model Checking, Conference on Computer Aided Verification (CAV’11), volume and Abstract Interpretation (VMCAI’07), volume 4349 of LNCS, 6806 of LNCS, pages 171–177. Springer, 2011. pages 299–313. Springer, 2007. 4. A. Bouajjani, P. Habermehl, and T. Vojnar. Abstract regular model 23. M. Ngyuen, M. Thalmaier, M. Wedler, J. Bormann, D. Stoffel, checking. In Proc. of 16th International Conference on Computer and W. Kunz. Unbounded protocol compliance verification usign Aided Verification (CAV’04), volume 3114 of LNCS, pages 197– interval property checking with invariants. IEEE Transactions on 202. Springer, 2004. Computer-Aided Design of Integrated Circuits, 27(11), 2008. 5. R. Brummayer and A. Biere. Boolector: An efficient SMT solver 24. D. A. Patterson and J. L. Hennessy. Computer Organization and for bit-vectors and arrays. In Proc. of International Conference on Design: The Hardware / Software Interface. Morgan Kaufmann, Tools and Algorithms for the Construction and Analysis of Systems Boston, fourth edition, 2012. (TACAS’09), volume 5505 of LNCS, pages 174–177. Springer, 25. J. Van Praet, D. Lanneer, W. Geurts, and G. Goossens. nML: A 2009. Structural Processor Modeling Language for Retargetable Com- 6. Randal E. Bryant. Formal verification of pipelined Y86-64 mi- pilation and ASIP Design, volume 1 of Systems on Silicon, pages croprocessors with UCLID5. Technical Report CMU-CS-18-122, 65–93. Morgan Kaufmann, Burlington, 2008. 2018. 26. Synopsys. ASIP Designer: Design Tool for Application Specific 7. J. R. Burch and D. L. Dill. Automatic verification of pipelined Instruction-Set Processors, Designer Datasheet, 2018. microprocessor control. In Proc. of Computer Aided Verification 27. M. N. Velev and P. Gao. Automatic formal verification of mul- (CAV’94), volume 818 of LNCS, pages 68–80. Springer, 1994. tithreaded pipelined microprocessors. In Proc. of International 8. Cadence. Tensilica Software Development Toolkit (SDK), 2014. Conference on Computer Aided Design (ICCAD’11), pages 679– 9. L. Charvat,´ A. Smrcka,ˇ and T. Vojnar. Automatic formal corre- 686. IEEE, 2011. spondence checking of ISA and RTL microprocessor description. In Proc. of Microprocessor Test and Verification (MTV’12), pages 6–12. IEEE, 2012. 10. L. Charvat,´ A. Smrcka,ˇ and T. Vojnar. HADES Hades Hard- ware Verification Tool. www.fit.vutbr.cz/research/ groups/verifit/tools/hades/, 2014. 11. L. Charvat,´ A. Smrcka,ˇ and T. Vojnar. Using formal verification of parameterized systems in RAW hazard analysis in microproces- sors. In Proc. of Microprocessor Test and Verification (MTV’14), pages 83–89. IEEE, 2014. 12. L. Charvat,´ A. Smrcka,ˇ and T. Vojnar. HADES: Microprocessor hazard analysis via formal verification of parameterized systems. In Proc. of 11th Doctoral Workshop on Mathematical and Engi- neering Methods in Computer Science (MEMICS’16), 233, pages 87–93. EPTCS, 2016. 13. E. Clarke, M. Talupur, and H. Veith. Environment abstraction for parameterized verification. In Proc. of Verification, Model Checking, and Abstract Interpretation (VMCAI’06), volume 3855 of LNCS, pages 126–141. Springer, 2006.