Branch Prediction for Network Processors
Total Page:16
File Type:pdf, Size:1020Kb
BRANCH PREDICTION FOR NETWORK PROCESSORS by David Bermingham, B.Eng Submitted in partial fulfilment of the requirements for the Degree of Doctor of Philosophy Dublin City University School of School of Electronic Engineering Supervisors: Dr. Xiaojun Wang Prof. Lingling Sun September 2010 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Doctor of Philosophy is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copy- right, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: David Bermingham (Candidate) ID: Date: TABLE OF CONTENTS Abstract vii List of Figures ix List of Tables xii List of Acronyms xiv List of Peer-Reviewed Publications xvi Acknowledgements xviii 1 Introduction 1 1.1 Network Processors . 1 1.2 Trends Within Networks . 2 1.2.1 Bandwidth Growth . 2 1.2.2 Network Technologies . 3 1.2.3 Application and Service Demands . 3 1.3 Network Trends and Network Processors . 6 1.3.1 The Motivation for This Thesis . 7 1.4 Research Objectives . 9 1.5 Thesis Structure . 10 iii 2 Technical Background 12 2.1 Overview . 12 2.2 Networks . 13 2.2.1 Network Protocols . 13 2.2.2 Network Technologies . 14 2.2.3 Router Architecture . 17 2.3 Network Processors . 19 2.3.1 Intel IXP-28XX Network Processor . 19 2.3.2 Cavium OCTEON Cn58XX . 20 2.3.3 State of the Art NP Architectures . 21 2.4 Network Processor Based Applications . 24 2.4.1 Packet Forwarding . 25 2.4.2 Quality of Service . 25 2.4.3 Security . 27 2.4.4 Payload Based Applications . 29 2.5 Scalability of Network Processors . 29 2.5.1 PE Parallelism . 30 2.5.2 Hardware Acceleration . 30 2.5.3 Technological Evolution . 33 2.5.4 PE Specific Techniques . 33 2.5.5 Summary of NP Scalability . 36 2.6 Micro-Architectural Considerations of PE Design . 37 2.6.1 Pipelining . 38 2.6.2 Pipelined Architecture . 40 2.6.3 Pipeline Hazards . 41 2.6.4 Control Hazards . 44 2.7 Techniques for Branch Prediction . 47 2.7.1 Static Branch Prediction . 47 2.7.2 Dynamic Branch Prediction . 50 iv 2.8 Conclusions . 57 3 Performance Evaluation Methods for Network Processors 59 3.1 Overview . 59 3.2 Simulation and Modelling of NP Architectures . 60 3.2.1 Mathematical Models . 61 3.2.2 Architectural Simulators . 64 3.3 Performance Metrics for NP Architectures . 68 3.3.1 Prior Benchmarks and Analysis . 68 3.3.2 Branch Predictor Performance Evaluation . 69 3.3.3 PE Area Cost . 71 3.3.4 Area Cost of Branch Prediction . 73 3.4 Conclusions . 73 4 A New Simulator for Network Processors 75 4.1 Overview . 75 4.2 SimNP Simulator . 76 4.2.1 Software Architecture . 76 4.2.2 SimNP Processing Engines . 78 4.2.3 SimNP Memory Hierarchy . 81 4.2.4 SimNP Inter-Device Communication . 81 4.2.5 SimNP Interface Unit . 82 4.2.6 SimNP Hardware Acceleration . 84 4.3 Comparison with Existing Solutions . 86 4.3.1 Simulation Time . 87 4.3.2 Simulation Performance . 88 4.3.3 Workload Validation . 89 4.4 Conclusions . 90 5 Analysis of NP Workloads 92 5.1 Introduction . 92 v 5.1.1 Network Processing Complexity . 93 5.2 Workload Analysis . 99 5.2.1 Network Algorithms and Applications . 101 5.2.2 Simulation Parameters . 105 5.3 Simulation Results . 106 5.3.1 Instruction Distribution . 107 5.3.2 Instruction Budget . 108 5.3.3 Memory Distribution . 113 5.3.4 Parallelisation . 115 5.4 Conditional Branches within NP applications . 119 5.4.1 Static Branch Analysis of NP Applications . 120 5.4.2 Dynamic Branch Analysis of NP Applications . 120 5.4.3 Branch Penalty per Packet . 122 5.5 Conclusions . 122 6 Branch Prediction in Process Engines 125 6.1 Introduction . 125 6.2 Performance Evaluation of Existing Prediction Schemes . 126 6.3 Gshare Predictor Performance . 129 6.4 Performance Limitations of Dynamic Predictors . 133 6.4.1 Payload Applications . 133 6.4.2 Header Applications . 135 6.4.3 Summary of Predictor Performance . 138 6.5 Utilising Packet Flow Information during Branch Prediction . 140 6.5.1 Flow Information For Payload Applications . 140 6.5.2 Flow Information For Header Applications . 143 6.5.3 Indexing Branch History . 144 6.5.4 Field-Based Branch Predictor . 147 6.6 Performance Evaluation of a Proposed Predictor . 150 6.6.1 Latency of Field-Based Branch Predictor . 150 vi 6.6.2 Chip Area of Field-Based Branch Predictor . 152 6.6.3 Utilisation of Field-Based Branch Predictor . 152 6.6.4 Performance Evaluation . 153 6.7 Conclusions . 162 7 Conclusions & Future Work 165 7.1 Motivation for Proposed Research – A Summary . 165 7.2 Summary of Thesis Contributions . 167 7.2.1 The SimNP NP Simulator . 167 7.2.2 Workload Analysis and Branch Behaviour of NP Applications . 168 7.2.3 Branch Prediction for Network Processors . 169 7.3 Future Work . 170 Bibliography 171 vii ABSTRACT Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increas- ing line rate as well as providing additional processing capabilities. To meet these re- quirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be imple- mented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine. Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware ac- celerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures. viii LIST OF FIGURES 1.1 7 Layer OSI Network Model . 6 1.2 Pipeline Depth Vs. Microprocessor Performance . 9 2.1 Network Layer Topology . 16 2.2 Example Router Line Card . 17 2.3 Router Architecture . 18 2.4 The Intel IXP 2805 . 20 2.5 Cavium Octeon Cn58XX . 20 2.6 Classifying Router . 26 2.7 Pipelining of a Combinational Circuit . 39 2.8 5-Stage Integer Pipeline . 40 2.9 Structural Hazard Within Pipeline Processor . 42 2.10 Data Hazard Within Pipeline Processor . 43 2.11 Branch Penalty Within Pipeline Processor . 45 2.12 Sample 2-Bit State Transitions for Branch Predictor . 50 2.13 2-Bit Predictor Table in SRAM . 53 2.14 Global History Prediction Schemes . 54 2.15 Per Address Dynamic Prediction Schemes . ..