Complexity and Correctness of a Super-Pipelined Processor

Complexity and Correctness of a Super-Pipelined Processor Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften der Naturwissenschaftlich-Technischen Fakultaten¨ der Universitat¨ des Saarlandes Jochen Preiß [email protected] Saarbrucken,¨ April 2005 ii Tag des Kolloquiums: 29. April 2005 Dekan: Prof. Dr. Jorg¨ Eschmeier Vorsitzender des Prufungsausschusses:¨ Prof. Dr. Gerhard Weikum Erstgutachter: Prof. Dr. Wolfgang J. Paul Zweitgutachter: Priv. Doz. Dr. Silvia M. Muller¨ akademischer Beisitzer: Dr. Sven Beyer iii Out Out!! You demons of stupidity!! – Dogbert Danke Diesen Abschnitt mochte¨ ich all denen widmen, die zum Gelingen dieser Arbeit bei- getragen haben. Zuallererst mochte¨ ich meinen Eltern dafur¨ danken, dass sie mich in allen Phasen mei- ner Ausbildung unterstutzt¨ haben und mir gerade in schwierigeren Zeiten stets ein Ruckhalt¨ waren. Mein ganz besonderer Dank gilt auch Herrn Prof. Paul fur¨ die Vergabe dieses interes- santen und herausfordernden Themas und fur¨ die wissenschaftliche Unterstutzung.¨ Bei Christian Jacobi mochte¨ ich mich fur¨ das Korrekturlesen dieser Arbeit und die vielen Vorschlage,¨ die mir geholfen haben, diese Arbeit immer weiter zu verbessern, bedanken. Meinen Freunden von der Uni Werner Backes, Christoph Berg, Sven Beyer, Mark Hillebrand, Thomas In der Rieden, Michael Klein und Dirk Leinenbach mochte¨ ich danken fur¨ die fruchtbaren Diskussionen und das gute Klima, gefordert¨ durch Tisch- fußball, Skat- und Doppelkopf-Abende und das Verschonen meines Rasens. Meinen Kollegen bei IBM, insbesondere Cedric´ Lichtenau und Thomas Pfluger,¨ gilt mein Dank fur¨ ihr Verstandnis¨ und die Unterstutzung¨ gerade in der letzten Phase mei- ner Dissertation. iv v Abstract This thesis introduces the DLXπ+, a super-pipelined processor with variable cycle time. The cycle time of the DLXπ+ may be as low as 9 gate delays (including 5 gate delays for registers), which is assumed to be a lower bound for the cycle time. For the parts of the DLXπ+ that significantly differ form previous implementations correctness proofs are provided. Formulas are developed which compute restrictions to the parameters of the DLXπ+, e.g., the maximum number of reservation station entries for a given cycle time. The formulas also compute what modifications to the base design have to be made in order to realize a certain cycle time and what the impact is on the number of pipeline stages. This lays the foundation for computing the time per instruction of the DLXπ+ for a given benchmark and different cycle times in future work in order to determine the “optimum” cycle time. Kurzzusammenfassung In dieser Arbeit wird die DLXπ+ eingefuhrt,¨ ein super-gepipelineter Prozessor mit variabler Zykluszeit. Die Zykluszeit der DLXπ+ kann bis auf 9 Gatter-Delays (inklusive 5 Gatter-Delays fur¨ Register) reduziert werden, was als untere Schranke fur¨ die Zy- kluszeit angesehen wird. Fur¨ die Teile der DLXπ+, die sich signifikant von bisherigen Implementierungen unterscheiden, werden Korrektheits-Beweise geliefert. Desweite- ren werden Formeln entwickelt, die Beschrankungen¨ fur¨ die Parameter der DLXπ+ wie zum Beispiel die maximale Anzahl von Reservation Station Eintragen¨ fur¨ eine gege- bene Zykluszeit berechnen. Die Formeln errechnen ausserdem welche Modifikationen am Basis-Design notwendig sind, um eine bestimmte Zykluszeit zu erreichen und welchen Einfluss dies auf die Anzahl der Pipeline-Stufen hat. Damit wird die Grundlage gelegt, um als zukunftige¨ Arbeit die benotigte¨ Zeit pro Instruktion der DLXπ+ fur¨ einen gegebenen Benchmark bei verschiedenen Zykluszeiten zu berechenen und damit die “optimale” Zykluszeit zu bestimmen. vi Extended Abstract In order to increase the performance of a processor regarding a specific benchmark one can either decrease the cycle time of the processor or the CPI (cycles per instruction) that the processor needs for the benchmark. Usual ways to decrease the CPI are, e.g., pipelining, out-of-order execution, branch prediction, or super-scalar designs. This thesis focuses on the cycle time. The cycle time of a processor can be improved by increasing the number of pipeline stages of the processor and therefore decreasing the amount of work to be done in each stage. This is called super-pipelining. Note that super-pipelining may increase the CPI for several reasons. Due to the increased number of pipeline stages, data dependencies may have a larger impact. Also, the fewer amount of logic that fits into one cycle may have a negative impact on the micro-architecture, e.g., reduce the maximum number of possible reservation stations entries. This may increase the frequency of stalls and therefore increase the CPI. Thus, the minimum cycle time may not be the optimal cycle time for a design and a given benchmark. This thesis introduces the DLXπ+, a super-pipelined processor with variable cycle time, i.e., with a variable number of pipeline stages. For computation of cycle time and cost of the DLXπ+ the technology independent gate model from [MP00] is used. The cycle time of the DLXπ+ may be as low as 9 gate delays (including 5 gate delays for the registers). For comparison, a 16 bit addition (which has 12 combinational gate delays in the used model) needs less than half a cycle in the deeply pipelined Pentium + 4 processor [HSU 01], but needs 3 cycles in the DLXπ+ with 9 gate delays cycle time. The variant of the DLXπ+ with 9 gate delays cycle time is more a proof of concept rather than it is assumed to have a good performance. Therefore, the main part of this thesis only handles cycle times of at least 10 in order to simplify the design. A variant of the DLXπ+ with a cycle time smaller than 9 is not assumed to be possible, although no formal proof for this is provided. In this thesis formulas are developed which compute restrictions to the parameters of the DLXπ+, e.g., the maximum number of reservation station entries for a given cycle time. Other formulas compute what modification to the base design have to be made in order to realize a certain cycle time and what the impact is on the number of pipeline stages. This lays the foundation to write a cycle-accurate DLXπ+ simulator, that computes the performance of the DLXπ+ for a given benchmark and different cycle times in future work. Using this simulator the optimum cycle time of the DLXπ+ for the benchmark could be determined. The DLXπ+is an out-of-order processor that uses the Tomasulo scheduler [Tom67]. The design is based on the work of Kroning¨ [Kro99].¨ The instruction set architecture (ISA) is taken from the MIPS R3000 processor [KH92] with small modifications sim- plifying the adaptation of the design. This allows a simulation of the DLXπ+with MIPS R3000 instruction traces [Hil95] of the SPEC92 benchmark [SPEC]. In order to realize the small cycle times, parts of the DLXπ+differ significantly from the design presented by Kroning.¨ In particular new stalling and forwarding techniques are used. If these techniques are used it is for example not longer obvious that a RAM access returns the correct result. Therefore, correctness proofs are provided for the critical parts of the DLXπ+. vii Zusammenfassung Um die Leistung eines Prozessors bezuglich¨ eines spezifischen Benchmarks zu verbessern, kann man entweder die Zykluszeit oder die CPI (benotigte¨ Anzahl von Zyklen pro Instruktion) des Prozessors reduzieren. Bekannte Methoden die CPI zu reduzieren sind zum Beispiel Pipelining, Out-of-order Execution, Branch Prediction oder super- skalare Designs. In dieser Arbeit geht es hingegen in erster Linie um die Reduzierung der Zykluszeit. Die Zykluszeit eines Prozessors kann durch Erhohen¨ der Anzahl von Pipeline- Stufen des Prozessors und damit durch Reduzierung der Arbeit, die in jeder dieser Stu- fen verrichtet werden muss, verbessert werden. Dies wird Super-Pipelining genannt. Man beachte, dass Super-Pipelining die CPI aus verschiedenen Grunden¨ erhohen¨ kann. Auf Grund der großeren¨ Anzahl von Pipeline-Stufen konne¨ Daten-Abhangigkeite¨ einen großeren¨ Einfluß haben. Ausserdem kann die geringe Menge von Logik, die in einen Zyklus passt, negative Auswirkungen auf die Mikro-Architektur des Prozessors haben, indem zum Beispiel die maximal mogliche¨ Anzahl von Reservation Stations Eintragen¨ reduziert wird. Dies kann die Haufigkeit¨ von Stalls und damit die CPI erhohen.¨ Daher muss die minimale Zykluszeit nicht notwendigerweise die optimale Zykluszeit fur¨ ein Design und einen gegebenen Benchmark sein. In dieser Arbeit wird die DLXπ+ eingefuhrt,¨ ein super-gepipelineter Prozessor mit variabler Zykluszeit, das heisst mit variabler Anzahl von Pipeline-Stufen. Zur Be- rechnung von Zykluszeit und Kosten der DLXπ+ wird das von der Technologie un- abhangige¨ Gatter Model aus [MP00] verwendet. Die Zykluszeit der DLXπ+ kann bis auf 9 Gatter-Delays (inklusive 5 Gatter-Delays fur¨ die Register) reduziert werden. Zum Vergleich, die Berechnung einer 16 bit Addition (die in dem benutzten Gatter Model 12 Gatter-Delays benotigt)¨ braucht weniger als einen halben Takt im tief gepipelineten + Pentium 4 Prozessor [HSU 01], aber braucht 3 Takte in der DLXπ+ mit 9 Gatter- Delays Zykluszeit. Die Variante der DLXπ+ mit 9 Gatter-Delays Zykluszeit dient nur als Machbarkeits- Beweis. Es wird nicht erwartet, dass sie eine gute Leistung erreicht. Deshalb betrachtet der Hauptteil dieser Arbeit nur Zykluszeiten von mindestens 10 um das Design zu ver- einfachen. Eine Variante der DLXπ+ mit einer Zykluszeit von weniger als 9 wird nicht als moglich¨ erachtet, auch wenn kein formaler Beweis dafur¨ gegeben wird. In dieser Arbeit werden Formeln entwickelt, die Beschrankungen¨ fur¨ die Parameter der DLXπ+wie zum Beispiel die maximale Anzahl von Reservation Station Eintragen¨ in Abhangigkeit¨ von der Zykluszeit berechnen. Andere Formeln errechnen, welche Modifikationen am Basis-Design notwendig sind um eine bestimmte Zykluszeit zu erreichen und welchen Einfluss dies auf die Anzahl der Pipeline-Stufen hat. Damit wird die Grundlage gelegt, um als zukunftige¨ Arbeit einen Zyklus-genauen DLXπ+- Simulator zu schreiben, der die Leistung der DLXπ+ fur¨ einen gegebenen Benchmark und verschiedenen Zykluszeiten berechnet.

Load more