UNIVERSIDADPOLITÉCNICADEMADRID

ESCUELATECNICASUPERIOR DEINGENIEROSDETELECOMUNICACIÓN

HARDWARE ACCELERATION OF MONTE CARLO-BASED SIMULATIONS

TESIS DOCTORAL

PEDRO ECHEVERRÍA ARAMENDI

INGENIERO EN TELECOMUNICACIÓN

2011 DEPARTAMENTO DE INGENIERÍA ELECTRÓNICA

ESCUELATECNICASUPERIORDEINGENIEROSDE TELECOMUNICACIÓN UNIVERSIDADPOLITÉCNICADEMADRID

PH.D.

HARDWARE ACCELERATION OF MONTE CARLO-BASED SIMULATIONS

Author: Pedro Echeverría Aramendi Telecomunication Engineer

Advisor: María Luisa López Vallejo Profesor Titular del Dpto. de Ingeniería Electrónica Universidad Politécnica de Madrid

2011 Ph.D. THESIS: Hardware Acceleration of Monte Carlo-Based Sim- ulations

AUTHOR: Pedro Echeverría Aramendi

ADVISOR: María Luisa López Vallejo

El tribunal nombrado por el Mgfco. y Excmo Sr. Rector de la Universidad Politécnica de Madrid el día 21 de Noviembre de 2011, para juzgar la Tesis arriba indicada, compuesto por los siguientes doctores:

PRESIDENTE: D. Carlos Alberto López Barrio

VOCALES: D. Javier Díaz Bruguera

D. Florent Dupont de Dinechin

D. Luis Entrena Arrontes

SECRETARIO: D. Carlos Carreras Vaquer

Realizado el acto de lectura y defensa de la Tesis el día 21 de Noviembre de 2011 en la E.T.S. de Ingenieros de Telecomunicación, acuerda otorgarle la calificación de:

El Secretario del tribunal A mi familia Contents

Contents i

Abstract vii

Resumen xi

Acknowledges xv

List of Figures xvii

List of Tables xxi

1 Introduction 1 1.1 Motivation ...... 3 1.1.1 Acceleration Features of FPGAs ...... 4 1.1.2 Applications ...... 4 1.1.3 Designing with FPGAs. Challenges ...... 6 1.2 Objectives and Thesis Structure ...... 9 1.2.1 Monte Carlo Simulations and Target Application: LIBOR Market Model .. 9

i 1.2.2 Objectives ...... 11 1.2.3 PhD Thesis Structure ...... 13

2 Random Number Generation 15 2.1 Random Number Generation: Overall Introduction ...... 16 2.2 Uniform Random Number Generation ...... 17 2.2.1 Linear Congruential Generators (LCG) ...... 18 2.2.2 Combined Generator Rand2 ...... 20 2.2.3 Tausworthe Generators ...... 20 2.2.4 Mersenne Twister ...... 22 2.3 N(0,1) Gaussian Random Number Generation ...... 22 2.3.1 Generation methods ...... 22 2.3.2 Monte Carlo Implications and Hardware Implementation ...... 24 2.3.3 Inversion Method with Quintic Hermite Interpolation ...... 26 2.4 Variance Reduction Techniques ...... 30 2.4.1 Stratified Sampling and Latin Hypercube ...... 31 2.5 Developed Gaussian Random Number Generator ...... 32 2.5.1 Uniform Random Number Generator ...... 32 2.5.2 N(0,1) Gaussian Random Number Generator ...... 38 2.5.3 Stratified Sampling and Latin Hypercube ...... 47 2.5.4 Complete GRNG and SW-HW comparison ...... 51 2.6 Extending N(0,1) RNG ...... 53 2.6.1 Parameterisable RNG based on N(0,1) RNG ...... 54 2.7 Conclusions ...... 55

3 Implementing Floating-Point Arithmetic in Configurable Logic 57 3.1 Related Works ...... 58 3.2 Floating Point Format IEEE 754 ...... 60 3.2.1 Format Complexity ...... 61 3.3 Floating-point Units for FPGAs. Adapting the Format and Standard Compliance .. 62 3.3.1 Simplification of Denormalized Numbers ...... 63

ii 3.3.2 Truncation Rounding ...... 64 3.3.3 Hardware Representation ...... 65 3.3.4 Global Approach Analysis ...... 66 3.4 Operators Architecture ...... 67 3.4.1 Adder/subtracter ...... 67 3.4.2 Multiplication ...... 68 3.4.3 Division ...... 69 3.4.4 Square Root ...... 71 3.4.5 Exponential and Logarithm Units ...... 72 3.5 Libraries Evaluation and Comparison ...... 76 3.5.1 Comparison with respect to a Commercial Library ...... 77 3.5.2 Operators Evaluation ...... 78 3.5.3 Replicability ...... 82 3.6 Towards Standard Compliance and Performance ...... 86 3.6.1 Simplification of Denormalized Numbers: One Bit Exponent Extension ... 86 3.6.2 Truncation Rounding: Mantissa Extension ...... 86 3.6.3 FPGA-oriented floating-point library ...... 87 3.7 Conclusions ...... 89

4 Exponentiation Operator 91 4.1 Exponentiation function ...... 92 4.1.1 Related Work ...... 93 4.2 Range and error analysis ...... 93 4.2.1 Input-output range analysis ...... 94 4.2.2 General Error analysis ...... 95 4.2.3 Error Analysis for accurate xy ...... 97 4.3 Variable precision implementation with FloPoCo ...... 100 4.3.1 Logarithm ...... 101 4.3.2 Multiplier ...... 101 4.3.3 Exponential ...... 101 4.3.4 Exceptions Unit ...... 102

iii 4.4 Experimental Results ...... 102 4.4.1 Results Analysis ...... 103 4.4.2 Comparison with previous work ...... 103 4.4.3 Exceptions Unit ...... 104 4.5 Conclusions ...... 104

5 LIBOR Market Model Hardware Core 107 5.1 LIBOR Market Model ...... 108 5.1.1 LIBOR Market Model as base to compute financial products ...... 110 5.2 Model Analysis ...... 111 5.2.1 Variables’ Range ...... 112 5.2.2 Simplifications to the model: Factorization ...... 113 5.2.3 Operators’ complexity ...... 114 5.2.4 Model Summary ...... 114 5.2.5 Qualitative Profiling ...... 115 5.2.6 Data dependencies ...... 116 5.3 Adapting the model to Hardware ...... 117 5.3.1 Simulation order ...... 117 5.3.2 Tailored Arithmetic ...... 118 5.4 FPGA Monte Carlo Libor Market Model Engine ...... 119 5.4.1 General Architecture ...... 119 5.4.2 Gaussian RNG Core ...... 122 5.4.3 LMM Core ...... 123 5.4.4 Product Valuation Core ...... 131 5.4.5 Control Unit ...... 132 5.5 LMM Engine Implementation ...... 134 5.5.1 Operators’ Features ...... 134 5.5.2 LMM Core. Precision-Accuracy-Performance ...... 135 5.5.3 Cores Implementation ...... 140 5.6 Conclusions ...... 140

iv 6 Hardware-Software Integration 143 6.1 Hardware-Software Partitioning ...... 144 6.1.1 Tasks Stability Characteristics ...... 144 6.1.2 Communication overheads ...... 145 6.1.3 Achieve maximum possible acceleration ...... 145 6.1.4 Partitioning Policy ...... 145 6.2 System Architecture and Communications ...... 146 6.2.1 Why PCI-Express? ...... 147 6.2.2 Communications Model ...... 148 6.2.3 Communications Requirements ...... 150 6.3 PCI Express Core ...... 151 6.3.1 Within FPGA Communications ...... 152 6.4 Software ...... 155 6.4.1 Driver and Low Level Functions ...... 155 6.4.2 Application modification ...... 156 6.5 Experimental Results ...... 157 6.5.1 Complete Accelerator Implementation Results ...... 158 6.5.2 Software Profiling ...... 158 6.5.3 Hardware-Software Solution Results ...... 163 6.6 Conclusions ...... 167

7 Conclusions 169 7.1 Contributions and Conclusions of this Thesis ...... 170 7.1.1 Random Number Generators ...... 170 7.1.2 Floating-Point Arithmetic Operators and FPGAs ...... 171 7.1.3 LMM Hardware Accelerator ...... 173 7.1.4 Capacity and Performance of FPGAs. Accelerator design ...... 174 7.1.5 Hardware-Software co-design and Integration ...... 174 7.2 Future Lines of Work ...... 175 7.2.1 Research lines related to Improvements ...... 175 7.2.2 New Research Lines ...... 176

v Bibliography 179

vi Abstract

During the last years there has been an enormous advance in FPGAs. Traditionally, FPGAs have been used mainly for prototyping as they offer significant advantages at a suitable low cost: flexibility and verification easiness. Their flexibility allows the implementation of different generations of a given application and provides space to designers to modify implementations until the very last moment, or even correct mistakes once the product has been released. Second, the verification of a design mapped into an FPGA is easier and simpler than in ASICs which require a huge verification effort. Additionally to these advantages, the technological advances have added great capabilities and per- formance to FPGAs, and even though FPGAs are not as efficient as ASICs in terms of performance, area or power, it is true that nowadays they can provide better performance than standard or digital signal processor (DSP) based systems. This fact, in conjunction with the enormous logic capacity allowed by today’s technologies, makes FPGAs an attractive choice for implementation of complex digital systems. Furthermore, with their newly acquired digital signal processing capabilities, FPGAs are now expanding their traditional prototyping roles to help offload computationally intensive functions from standard processors. This Thesis is focused on the last point, the use of FPGAs to accelerate computationally intensive applications. The use of FPGAs for hardware acceleration is an active research field. However, there are still several challenges concerning the use of FPGAs as accelerators:

• Availability of Cores. • Capability and performance of FPGAs. • Methods, algorithms and techniques suited for FPGAs.

vii • Design tools. • Hardware-Software co-design and integration. Studying in depth each one of these five challenges related to hardware acceleration is not feasible in just one Thesis. The great variety of applications that can be accelerated and the different features among them imply that the complexity of each task is high. Therefore, in this Thesis we have chosen one subset of applications to be studied, dealing with the implementation of a real application of this subset. Selecting a complex subset of applications, in our case Monte Carlo simulations, allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a particular application. This subset of applications has several features shared with other appli- cations and allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a given application. Specifically, we have selected a financial applica- tion, the Monte Carlo based LIBOR Market Model. Developing an FPGA application from scratch is almost impossible and availability of cores is a must for shorten development time. Following this idea, one of the main objectives is to study the common elements that play a key role in Monte Carlo simulations and in our target application (and shared with many other applications). Two common elements have been outstood: • The random number generators that are required for the underlying random variables, • Floating-point operators, which are the base elements for implementing the mathematical mod- els that are evaluated. In this way, the first objective of this Ph.D. Thesis is the study, design and implementation of random number generators. In particular, we have focused on Gaussian random number generation and the implementation of a complete generator compatible with variance reduction techniques that can be used for our target application and for other applications. In this field we have developed a high-quality high-performance Gaussian random number gen- erator which is parameterizable and compatible with the also developed parameterizable Latin Hy- percube core and a high performance Mersenne Twister generator. Research results in this field demonstrate that random number generation is ideal for hardware acceleration, as an isolated core or within bigger accelerators. Meanwhile, the second objective has dealt with the implementation of efficient and FPGA-oriented mathematical operators (both basic and complex and using floating-point arithmetic). We focused on the design, development and characterization of libraries of components. Instead of focusing on the algorithms of the operators, our approach has been to study how the format can be simplified to ob- tain operators that are better suited for FPGAs and present better performance. One important goal searched here was to achieve libraries of general purpose components that can be reused in several applications and not just in a particular target application.

viii Different design decisions have been studied and analyzed, and from this analysis, the impact of the overhead due to some of the floating-point standard features has been determined. The format overhead implies a major use of resources and reducing it is a must to obtain operators, indepen- dently of what underlying calculation algorithm, that are better suited for FPGAs while present better performances. In particular, the handling of denormalized numbers has a major impact on the FPGA operators. Following the results obtained in that studied, we have discussed and selected a set of features that implies improved performance and reduced resources. This set, has been chosen to de- sign two additional hardware FPGAs-oriented libraries that ensure (or even improve) the accuracy and resolution given by the standard. The operators of these libraries are the base components for the implementation of target application. Additionally, a second analysis has been carried out to study the capabilities of FPGAs to imple- ment complex datapaths. This analysis shows the huge capabilities of current FPGAs which allow up to hundreds of single floating-point operators. Although this capacity, this second analysis has also demonstrate how the working frequency of the operators is severely affected by the routing of their elements when the operators are not isolated and a high percentage of the resources of an FPGA are used. Related to the target application, a third objective of this work was to deepen on the implementa- tion of a particular operator, the exponentiation function. This operator is required in many scientific and financial simulations. Its complexity and the lack of previous general purpose implementations have deserved special attention. We have developed and presented an accurate exponentiation operator for FPGAs based on the straightforward translation of xy into a chain of sub-operators and on the FPGA flexibility which allows tailored precisions. Taking advantage of this flexibility, the provided error analysis focused on determining which precisions are needed in the partial results and in the internal architectures of the sub-operators to obtain an accurate operator with a maximum error of one u. Finally, the integration of this error analysis and the development of the operator within the FloPoCo project have allowed to automatize the generation of exponentiation operators with variable precisions. The next objective we tackle was related to the global purpose of the Thesis of validating all the previously developed elements for the implementation of a complex Monte Carlo simulation which involves all the features that can be found in Monte Carlo simulations. In this way, we have deal with the implementation of the target application, the LIBOR Market Model (LMM). Special attention was devoted to all the features, requirements and circumstances that affect to the performance of the accelerator. A complete LMM hardware core has been developed and its results validated against the original software implementation. Three main features were analyzed:

• Correctness of the results obtained.

ix • Accuracy. • Speedup factors obtained by the global application and by each of the main components.

Finally, the last objective was the integration of the hardware accelerator within the original soft- ware application. All issues related to the communication mechanism are studied putting special focus on how performance is affected by data transfers and by the hardware-software partitioning policy implemented. Following the partitioning policy selected, we have developed the infrastructure (both hardware and software) required to make possible the integration of our accelerator within a software applica- tion. A mechanism, based on the use of two RAM memory zones and a PCI-E core with Bus Master capabilities in the FPGA, has been proposed and implemented. And it has allowed us to extend the intrinsic parallelism of Monte Carlo simulations to how the CPU and the FPGA work together. In this way, we exploit the CPU to work in parallel with the FPGA, overlapping their execution times. Hence, the software execution time affecting the performance is reduced to the initial and final processing and to the product valuation in case it is slower that LMM plus the random generator in the FPGA. With this scheme we have achieved high speedups, around 18 times, and close to the theoretical limit for our cases: when there is no software not ported to Hardware or which execution is overlapped with the FPGA execution (the LMM plus RNG achievable speedup). In this case, the speedup achieved could be considerably improved using new FPGAs and several LMM cores in parallel.

x Resumen

Durante los últimos años ha habido un enorme avance en la tecnología y capacidades de las FPGAs. Tradicionalmente, las FPGAs se han utilizado principalmente para el desarrollo de prototipos, ya que ofrecen importantes ventajas a un bajo coste: flexibilidad y facilidad de verificación. Su flexibilidad permite la implementación de las diferentes versiones de una aplicación determinada y permite a los diseñadores modificar las implementaciones hasta el último momento, o incluso corregir errores una vez que el producto esta siendo utilizado. En segundo lugar, la verificación de un diseño en una FPGA es más fácil y más sencillo que en ASIC, donde requieren un esfuerzo de verificación enorme. Además de estas ventajas, los avances tecnológicos han permitido FPGAs con grandes capacidades a la vez que se ha aumentado su rendimiento. Y aunque las FPGAs no sean tan eficientes como los ASIC en términos de rendimiento, recursos o el consumo de potencia, hoy en día pueden ofrecer un mejor rendimiento que un sistema estándar o que uno basado en procesadores digitales de señal (DSP). Esto, junto con la enorme capacidad de recursos lógicos alcanzada por las tecnologías de hoy, hace de las FPGAs una opción atractiva para la implementación de sistemas digitales complejos. Además, con su recientemente adquirida capacidad de procesamiento de señal digital, las FPGAs están ampliando su rol tradicional de prototipos al rol de coprocesador para descargar de cálculos intensivos a los procesadores estándar. Esta tesis se centra en el último punto, el uso de FPGAs para acelerar las aplicaciones com- putacionalmente intensivas. El uso de FPGAs para la aceleración de hardware es un área activa de investigación. Sin embargo, todavía hay varios desafíos relativos al uso de FPGAs como aceleradores: • Disponibilidad de cores de implementación.

xi • Capacidad y rendimiento de las FPGAs. • Necesidad de métodos, algoritmos y técnicas adecuadas para FPGAs. • Herramientas de diseño. • Co-diseño de Hardware-Software y su integración El estudio detallado de cada uno de estos cinco desafíos relacionados con la aceleración de hard- ware no es factible en tan sólo una tesis. La gran variedad de aplicaciones que pueden ser aceleradas y las diferentes características entre ellas, implica que la complejidad de cada tarea es alta. Por lo tanto, en esta tesis se ha elegido un conjunto de aplicaciones a estudiar, y se ha llevado a cabo la implementación de una aplicación real de este subgrupo. La selección de un subconjunto de aplicaciones complejas, en nuestro caso las simulaciones Monte Carlo, nos permite hacer un análisis general de la aceleración de hardware, nuestro campo principal, desde el estudio, análisis y diseño de una aplicación en particular. Este conjunto de aplica- ciones tiene varias características compartidas con otras aplicaciones y nos permite hacer un análisis general de la aceleración de hardware desde el estudio, análisis y diseño de una aplicación dada. En concreto, hemos seleccionado una aplicación financiera, la simulación del LIBOR Market Model basado en Monte Carlo. El desarrollo de las aplicaciones en FPGAs a partir de cero es casi imposible y la disponibilidad de los cores es una necesidad para acortar el tiempo de desarrollo. Siguiendo esta idea, uno de nuestros principales objetivos es el estudio de los elementos comunes que juegan un papel clave en las simulaciones de Monte Carlo y en la aplicación seleccionada (y compartidos con muchas otras aplicaciones). Dos elementos comunes han sido destacados: • Los generadores de números aleatorios que se requieren para las variables aleatorias subya- centes. • Los operadores de punto flotante, que son los elementos base para implementar los modelos matemáticos que se evalúan. De esta manera, el primer objetivo de esta Tesis es el estudio, diseño e implementación de gen- eradores de números aleatorios. En particular, nos hemos centrado en la generación de números aleatorio con distribución Gaussiana y en la implementación de un generador completo y compat- ible con técnicas de reducción de varianza que se utilizan en la aplicación seleccionada y en otras aplicaciones. En este campo de investigación hemos desarrollado un generador de números aleatorios gaus- sianos de alta calidad y alto rendimiento. A su vez, este generador es parametrizable y compatible con el módulo parametrizable de hipercubo latino también desarrollado y con un generador Mersenne Twister de alto rendimiento. Los resultados de investigación en este campo demuestran que la gen- eración de números aleatorios es idónea para la aceleración de hardware, tanto como un núcleo aislado o integrado en aceleradores mayores.

xii El segundo objetivo se ha ocupado del desarrollo de operadores matemáticos eficientes y ori- entados a FPGAs (tanto básicos como complejos y con aritmética de punto flotante). Nos hemos centrado en el diseño, desarrollo y caracterización de las librerías de componentes. En lugar de cen- trarnos en los algoritmos de los operadores, nuestro enfoque ha sido la de estudiar cómo el formato se puede simplificar para obtener operadores más adecuados para FPGAs y que a su vez presenten un mejor rendimiento. Un objetivo importante aquí buscado ha sido lograr librerías de componentes de propósito general que pueden ser reutilizados en varias aplicaciones y no sólo en una aplicación seleccionada en esta tesis. Diferentes decisiones de diseño se han estudiado y analizado. De este análisis, hemos determi- nado el impacto de la sobrecarga debido a algunas de las características del estándar de punto flotante. La sobrecargas que presenta este formato implican un mayor uso de los recursos y su reducción es una necesidad para obtener operadores más adecuados para FPGAs y con mejor rendimiento, inde- pendientemente de lo que el algoritmo de cálculo subyacente. En particular, el manejo de los números denormalizados tiene un gran impacto en los operadores de FPGA. Con los resultados obtenidos en ese estudio, hemos analizado y seleccionado un conjunto de características que implican un mejor rendimiento y una reducción de los recursos. Este conjunto, ha sido elegido para diseñar dos librerías adicionales para FPGA orientadas a garantizar (o incluso mejorar) la precisión y la resolución dada por el estándar. Los operadores de estas librerías son los componentes básicos para la implementación de la aplicación seleccionada. Además, un segundo análisis se ha llevado a cabo para estudiar las capacidades de los FPGAs para implementar complejos arquitecturas de datos. Este análisis muestra las enormes capacidades de FPGAs actuales que permiten a la implementación de cientos de operadores punto flotante en la misma FPGA. A pesar de esta capacidad, este segundo análisis también demuestra cómo la frecuencia de trabajo de los operadores se ve gravemente afectada por el interconexionado de sus elementos cuando los operadores no están aislados y se están utilizando un alto porcentaje de los recursos de la FPGA. Relacionado con la aplicación de destino, un tercer objetivo de este trabajo ha sido profundizar sobre la implementación de un operador en particular, la función exponenciación. Este operador es utilizado en muchas simulaciones científicas y financieras. Su complejidad, y la falta de las anteriores implementaciones de propósito general han merecido una atención especial. Hemos desarrollado y presentado un operador exponenciación exacto para FPGAs basado en la traducción directa de xy en una cadena de sub-operadores y en la flexibilidad de las FPGA que permite precisones a medida. Tomando ventaja de esta flexibilidad, el análisis de error se centró en determinar que lprecisiones son necesarias en los resultados parciales y en la arquitectura interna de los operadores de sub-para obtener un operador exacto con un error máximo de un ulp. Por último, la integración de este análisis de error y el desarrollo del operador en el proyecto FloPoCo han permitido automatizar la generación de los operadores de exponenciación con precisiones variables.

xiii El siguiente objetivo ha sido abordar, en relación con el objetivo global de la Tesis, la validación de todos los elementos desarrollados anteriormente con la implementación de un modelo complejo de simulación de Monte Carlo que incluye todas las características que se pueden encontrar en este tipo de simulaciones. De esta manera, abordamos la implementación de la aplicación seleccionada, el LIBOR Market Model (LMM). Se prestó especial atención a todas las características, requisitos y circunstancias que afectan al rendimiento del acelerador. Un core completo del LMM ha sido desarrollado en Hardware y validado contra los resultados del software original. Tres características principales se han analizado:

• La exactitud de los resultados obtenidos. • La precisión necesaria para el hardware. • Los factores de aceleración obtenidos por la aplicación global y por cada uno de los compo- nentes principales.

Finalmente, el último objetivo ha sido la integración del acelerador hardware con la aplicación de software original. Todas las cuestiones relacionadas con los mecanismos de comunicación se han estudiado poniendo especial énfasis en cómo el rendimiento se ve afectado por las transferencias de datos y por la política de particionamiento hardware-software implementado. Siguiendo la política de particionamiento seleccionada, hemos desarrollado la infraestructura (hardware y software) necesarios para hacer posible la integración de nuestros acelerador dentro de una aplicación de software. Un mecanismo, basado en el uso de dos zonas de memoria RAM y un core PCI-E con capacidad bus master en la FPGA, se ha propuesto e implementado. Y nos ha per- mitido extender el paralelismo intrínseco de simulaciones de Monte Carlo a la forma en la CPU y la FPGA trabajen juntos. De esta manera, se aprovecha la CPU para trabajar en paralelo con la FPGA, superponiendo sus tiempos de ejecución. Por lo tanto, el tiempo de ejecución de software que afecta el rendimiento se reduce al tratamiento inicial y final y la valoración del producto en caso de que sea más lento que el LMM más el generador de números aleatorios en la FPGA. Con este esquema hemos logrado incremento de velocidad altos, alrededor de 18 veces, y muy cerca del límite teórico de nuestros casos: cuando no hay software no portado al hardware o que su ejecución no se superponga con la ejecución de FPGA (la máxima aceleración alcanzable solo teniendo en cuenta el LMM más la generación de números aleatorios). En este caso, la aceleración lograda podría ser considerablemente mejorada con FPGAs nuevas y con varios núcleos de LMM en paralelo.

xiv Acknowledgments

I would like to thank to Marisa, who have been guiding my research since my final degree project and working with my since then. This work is also hers. Thanks for giving me the opportunity of working in this Thesis, for all her advices and contributions that have enriched this Thesis and for these years of collaboration and friendship. Thanks also to Carlos López Barrio for his support, for his advices and for sharing his experience with me. Also I would like to thank to all the people from the LSI research group all these years of sharing good moments, meals and conversations. Special thanks to Miguel Angel and Pablo who have collaborate with my in some research fields and have become good friends. I would also like to thank to my friend Paco, who have helped me in the development of the driver and shared with me many conversations about this Thesis and its research. And to Florent, without whom chapter four woud not have been possible. Thanks to Pedro and Rocio, who have shared many hours and conversations with me at the office. I would like to acknowledge also to BBVA and the New Products Department for funding and supporting this Thesis through the project P060920579. Specially, I would like to thanks to Miguel Ángel, Javier, Manuel and José María and Antonio. To my parents and family, this book is for them. And finally, thanks to my wife Marta, for all her support, love and patience during these years.

xv

List of Figures

1.1 Thesis Structure...... 13

√1 2.1 Plot of m ...... 25 2.2 Inversion Method...... 27 2.3 Dimension Impact: Stratified Sampling & Latin Hypercube...... 32 2.4 Mersenne Twister General FPGA Architecture...... 35 2.5 MT work are Storage...... 36 2.6 UNU.RAN segmentation...... 39 2.7 Inversion N(0,1) RNG architecture...... 44 2.8 Search unit...... 44 2.9 Variance reduction general hardware architecture...... 47 2.10 Stratified Sampling Control Unit...... 48 2.11 Latin Hypercube Control Unit...... 49 2.12 Stratified Sampling and Latin Hypercube results...... 51 2.13 Inversion based GRNG with Variance Reduction technique...... 51 2.14 Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG ...... 54

3.1 Floating-Point word...... 60

xvii 3.2 Floating-Point Operator...... 61 3.3 Adder-Subtracter...... 68 3.4 Multiplier Architecture ...... 70 3.5 Division step...... 71 3.6 Square Root...... 72 3.7 Exponential function unit...... 74 3.8 Logarithm function unit ...... 75 3.9 Operators Evaluation...... 79 3.10 Slices and Stages per type for each library...... 80 3.11 Operators Replicability (HP Library)...... 82 3.12 Synthetic Datapath...... 83 3.13 Adder Replicability Results...... 84 3.14 Divider Replicability Results...... 84 3.15 Multiplier Replicability Results...... 85 3.16 Logarithm Replicability Results...... 85 3.17 Towards Standard Operators Evaluation...... 88

4.1 Simplified overview of a power function unit...... 92 4.2 Power function architecture...... 101

5.1 LIBOR Forward Rates...... 109 5.2 Monte Carlo LIBORs simulation...... 110 5.3 LMM Monte Carlo Simulation with Latin Hypercube...... 117 5.4 Engine Architecture...... 120 5.5 LMM Core unit ...... 124 5.6 Parallel Correlation...... 125 5.7 Sequential Correlation...... 125 5.8 Drift Calculation...... 129 5.9 LIBOR Calculation...... 130 5.10 LMM datapth...... 130 5.11 Product Valuation Core...... 131 5.12 Architecture for LMM accuracy measurement...... 136

xviii 5.13 SW-HW difference in average ...... 137 5.14 Maximum SW-HW difference ...... 138 5.15 SW-HW difference in the last time step ...... 139

6.1 Integration Architecture...... 146 6.2 Communications Flow ...... 147 6.3 Dataflows...... 149 6.4 PCI-Express Core ...... 152 6.5 PCI-Express & Accelerator Interface ...... 154 6.6 Detailed-view Software-Hardware modified dataflow ...... 157

xix

List of Tables

1.1 Speedups obtained in different case studies ...... 6

2.1 Gaussian Generation Methods-Selection Criteria...... 26 2.2 Gaussian ICDF Implementation Requirements...... 28 2.3 Virtex-4 XC4VFX140-11. Table of resources...... 37 2.4 URNG Implementation Results...... 37 2.5 Accuracy and Segmentation...... 45 2.6 Maximum segment size - Number of segments (searched 21 mantissa bits of accuracy) 45 2.7 FPGA N(0,1) RNG results...... 47 2.8 Variance Reduction implementation Results ...... 50 2.9 Complete N(0,1) RNG results...... 51 2.10 Hardware-Software Comparison ...... 52 2.11 Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG results...... 54

3.1 Floating-Point Operators Libraries. Four Basic Operators ...... 59 3.2 Types of floating-point numbers...... 61 3.3 Logic Reduction due to Design Simplifications...... 67 3.4 Operators Results. Commercial Library...... 77

xxi 3.5 Operators Results...... 78 3.6 Slices type of logic per operator ...... 79 3.7 Split Slices comparison...... 80 3.8 Split Pipeline Stages comparison...... 80 3.9 FPGA Resources...... 82 3.10 Operators results with the final proposed features...... 87 3.11 Required Interfaces...... 88

4.1 Exception handling for the exponentiation function in the IEEE-754 standard. .... 93 4.2 Sub-operators Range Analysis...... 94 4.3 Powering function Relative Error (ulp)...... 97 4.4 Synthesis results for Virtex-4 (4vfx100ff1152-12) for pow function...... 102 4.5 Separate synthesis results for the sub-component (targeting 200MHz)...... 103 4.6 Separate synthesis results for Exception unit...... 103 4.7 Exception’s control results...... 104

5.1 Parameters Range...... 115 5.2 Implementation Results for the Modified operators...... 139 5.3 Implementation Results of the LMM Engine for V5-FX200...... 140

6.1 Types of Data Transfers...... 150 6.2 Implementation results of the complete accelerator for a V5-FX200...... 158 6.3 Test Simulation Features ...... 159 6.4 Software Profiling...... 160 6.5 Main variables to compute per subpath ...... 160 6.6 Extrapolated Profile (5000 grouped paths) ...... 161 6.7 LMM, GRNG and LMM+RNG Speedups ...... 162 6.8 Extrapolated achievable speedup ...... 162 6.9 Hardware-Software Profiling...... 164 6.10 Extrapolated times and measured speedup ...... 164 6.11 Resources of the Different Xilinx FPGA Families...... 165 6.12 Advanced FPGAs extrapolation...... 166

xxii 1

Introduction

Since the first digital were designed and developed, one of the main necessities that have boosted the research in science has been the need for higher performance. There has always been a huge number of applications pushing the limits of computer technologies, demanding more performance: scenarios with real time requirements or excessive long execution times which made the application unfeasible. On the other side, never-ending improvements on silicon technologies happen bringing higher and higher performance: deep nanometer technologies (currently 20 nm is almost available [Ete11] and 14nm is expected in 2012) which allow the integration of billions of transistors in a single die, dif- ferent types of logic core devices with different oxide thicknesses and threshold voltages to meet the requirements of high performance, low-standby power, or low-operating power circuit applications, etc. In addition to those continuous improvements, designers have relied on solutions based on special architectures to accelerate the performance of these applications, with processing units exploiting their common features such as parallelism, repetitive tasks or intensive mathematical processing. Traditionally, these solutions have been of two types: 2 Introduction

• Parallel processing computers with parallel processors and/or datapaths. • Dedicated hardware, accelerators, specialized in computing one type of processing task which complements conventional architectures.

However, this situation has changed. Computers based on conventional architectures and proces- sors have always been much more competitive in price than parallel processing computers. There- fore, as the technology to interconnect different computers to work like just one big supercomputer has been continuously improving, parallel processing computers have been gradually replaced by clusters of conventional computers and multicore processors. Furthermore, as power consumption has become a main concern in recent years [KAB+03], has become the dominant paradigm in computer architecture and has been incorporated to conventional architectures, mainly in the form of the above mentioned multicore processors. In this new scenario dominated by conventional multicore computers and clusters built with them, acceleration continues to be a great necessity due to the following reasons:

• There are applications that cannot be accelerated by a cluster or a multicore architecture: – Single-thread applications. – Embedded systems. – Single computer environments. • Energy and space required by large clusters are key limiting factors. • Multicore processors are based on general purpose cores. Consequently, they are not optimal for all types of computations.

Therefore, while the use of specific parallel processing computers has declined, new solutions continue to appear in the field of hardware accelerators. In particular, the use of complementary hardware accelerators is blossoming and becoming more and more important. Focusing on those problems, acceleration can be provided using different technologies, mainly three of them:

• ASICs, Application Specific Integrated Circuits • GPUs, General Purpose Graphical Processing Units. • FPGAs, Field Programmable Gate Arrays

ASICs share the same technology than general purpose microprocessors but they are specifically designed for a particular application and are not controlled by software. Therefore, they can be the most efficient technology for any computational task. However, their use for acceleration purposes is very limited. On the one hand, they do not offer any flexibility, as the task that they perform cannot be modified. On the other hand, the high cost of designing and manufacturing any integrated circuit, restricts their use to applications where millions of circuits can be sold. 1.1 Motivation 3

In the last years, GPUs, which are already accelerators for personal computers that handle all graphics processing, have started to be used to accelerate applications with similar characteristics to graphics processing due to the significative performance they have achieved. Furthermore, the addition of programmable stages in the GPUs datapath has allowed generalizing the use of GPUs, leading to general-purpose computing on graphics processing units, GPGPU [RHS+08]. However, applications with complex feedback loops and control or extensive bit handling are not suitable for GPU implementation. Meanwhile, the high power consumption of GPUs restricts their use in certain environments. As with GPUs, during the last years there has been an enormous advance in FPGAs. Traditionally, FPGAs have been used mainly for prototyping as they offer significant advantages at a suitable low cost [Hau98]: flexibility and verification easiness. Their flexibility allows the implementation of dif- ferent generations of a given application and provides space to designers to modify implementations until the very last moment, or even correct mistakes once the product has been released. Second, the verification of a design mapped into an FPGA is easier and simpler than in ASICs which require a huge verification effort. Additionally to these advantages, the technological advances have added great capabilities and performance to FPGAs, and even though FPGAs are not as efficient as ASICs in terms of perfor- mance, area or power, it is true that nowadays they can provide better performance than standard or digital signal processor (DSP) based systems. This fact, in conjunction with the enormous logic capacity allowed by today’s technologies, makes FPGAs an attractive choice for implementation of complex digital systems. Furthermore, with their newly acquired digital signal processing capabili- ties [Xilf], FPGAs are now expanding their traditional prototyping roles to help offload computation- ally intensive functions from standard processors.

1.1. Motivation

This Thesis is focused on the last point, the use of FPGAs to accelerate computationally intensive applications. Current deep sub-micron technologies allow manufacturing FPGAs with extraordinary logic density and speed. The initial challenges related to FPGAs programmability and large inter- connection capacitances (poor performance, low logic density and high power dissipation) have been overcome [KR07] while providing attractive low cost and flexibility. Additionally, nowadays FPGAs provide not only configurable logic. They also provide power- ful DSP units (mainly for multiplication and accumulation operations) and embedded RAM mem- ory [Altb, Xild], while high speed elements are also provided for interconnections. Subsequently, the use of FPGAs in the implementation of complex applications, see section 1.1.2, is increasingly common thanks to the set of features that make FPGAs a powerful alternative for acceleration, see Section 1.1.1. However, several challenges and concerns are still remaining related to hardware ac- 4 Introduction celeration with FPGAs, Section 1.1.3.

1.1.1. Acceleration Features of FPGAs

Even though the clock frequencies that can be achieved with an FPGA are low when compared to a high-end microprocessor, ten times slower, accelerating applications with FPGAs can potentially deliver enormous performance [HVG+07a] due to the intrinsic nature of the FPGA’s architecture, which allows:

• Parallel architectures. • Cascaded datapaths. • Deeply pipelined architectures.

When designing an accelerator with an FPGA, two levels of parallelism can be achieved. Firstly, multi-thread parallelism as the application datapath can be replicated in parallel as many times as possible, being the number of resources of the FPGA the only a priori limit. And secondly, the datapath can also be a parallel datapath executing in parallel several instructions and operations taking into account only data dependencies. Furthermore, datapaths in an FPGA can be also implemented as a complete chain, as all overhead instructions needed in software related to indexes, memory access, conditional loops, etc. can be done in parallel and in advance. Thereby, control instructions and instructions related to moving data do not affect the datapath. Finally, the FPGAs combination of combinational and register logics allows deeply pipelined architectures and processing new data every clock cycle, with the only limitations implied by data dependencies or communications.

1.1.2. Applications

Computationally intensive applications can be found in almost every field where a computer is used. Nevertheless, the advances in computer technologies have provided solutions to a great number of these applications making possible their execution with up-to-date conventional computers. However, there exist other complex applications requiring higher performance than the one pro- vided by conventional computers. Additionally, new more complex applications continue to appear. These applications can be classified into three interrelated groups:

• High Performance Computing. • Real time applications. • Intensive applications in embedded systems. 1.1 Motivation 5

The range of applications where FPGAs can be used as accelerators includes any application where some degree of parallelism can be exploited or where a large datapath is required. Related research fields can be bioinformatics [EH09], where FPGAs are used for DNA sequenc- ing and dot plotting, or molecular dynamics simulations [SGTP08, CHL08, AAS+07], where the simulation of the motion models for the time evolution of a set of interacting particles is accelerated. Another outstanding research field is the processing of medical images, where real-time features are desirable and can be achieved with FPGAs [GMDH08, DCPS07]. FPGA acceleration also includes fields as financial simulation [Fel11], geophysical simulation for oil research, aerodynamics research, fluid dynamics, etc. Another interesting research field is the implementation of basic subsystems [ZP08]. While they are not applications by themselves, these linear algebra subsystems are computationally intensive routines common to many applications.

1.1.2.1. Monte Carlo Applications

Finally, we have to stand out one subset of applications, Monte Carlo simulations. Monte Carlo ac- celeration with FPGAs is an active research field in recent years as the intrinsic nature of the Monte Carlo approach can be exploited by FPGA properties. On the one hand, Monte Carlo models repeat thousands of times the same calculations, only varying the values of the underlying random variables. Therefore, they fit perfectly with FPGAs parallelism capabilities of replicating many times the dat- apath. On the other hand, Monte Carlo datapaths are mainly composed of mathematical operations with a small load of control instructions, and therefore are very well suited for exploiting datapath capabilities of deeply pipelined chained datapaths. This is the case of physical simulations like [PF06], where an FPGA has been used to approach a real-time solution for a radiation transportation problem, [LRL+09] where Monte Carlo is used to calculate light propagation in tissues for medical photodynamic therapy, or in radiotherapy treatment planning [FFM+10]. Another remarkable field is financial simulation [ZLHea05, KCL08, MA07, WV08, TTTL10], where pricing calculation of different financial products is accelerated. Furthermore, circuit design can take advantage of Monte Carlo simulations for timing analysis methods on VLSI design [YTOS11]. In Table 1.1 the speedup factors obtained using FPGAs are summarized for some of these ap- plications with a maximum speedup of 650. Additionally, not only simulation time is significatively reduced when FPGAs are used, energy consumption is also decreased. This feature is not usually measured, however some results can be found in the literature where energy consumption is reduced by a factor of 45 [LRL+09]. 6 Introduction

Table 1.1: Speedups obtained in different case studies Field SpeedUp (xTimes) [EH09] Bioinformatics 45 [PF06] Physical Monte Carlo 650 [LRL+09] Physical Monte Carlo 80 [ZLHea05] Financial Monte Carlo 26 [KCL08] Financial Monte Carlo 63 [WV08] Financial Monte Carlo 50 [TTTL10] Financial Monte Carlo 24

1.1.3. Designing with FPGAs. Challenges

As just seen, the use of FPGAs for hardware acceleration is an active research field. However, there are still several challenges concerning the use of FPGAs as accelerators. We have identified the following key challenges:

• Availability of Cores. • Capability and performance of FPGAs. • Methods, algorithms and techniques suited for FPGAs. • Design tools. • Hardware-Software co-design and integration.

1.1.3.1. Availability of Cores

Designing complex applications from scratch with FPGAs makes the design cycle extremely long. It is necessary to analyze the application to design the architecture required for the datapath with the target of achieving the highest possible performance. Furthermore, it must be analyzed how to integrate the control requirements into that datapath. Finally it is necessary to develop all the required basic elements, as the RNG, or the arithmetic operators, etc. which implies studying the type of arithmetic to be used and the resolution and precision of the numbers (data representation). Thus, the availability of complete and fully characterized basic elements (i.e. operators) targeting FPGAs has become essential to shorten the time needed to design an application while making its design easier. In this way, when designing any application, it is basic for the designer to have at its disposal libraries of mathematical operators and other components, being this topic one of the foci of this Thesis; the analysis and design of mathematical operators. Other components, like communication cores, also play a key role in the development of any application because the hardware accelerator must interact with the host system. Hence, this topic 1.1 Motivation 7 will be also analyzed in this Thesis.

1.1.3.2. Capability and performance of FPGAs

FPGA resources are mainly programmable logic plus interconnections between the logic. Once a design is completed, it is mapped and routed into these logic elements to configure the FPGA with the design functionality. In this way, the results obtained for an application can vary depending on which logic the application requires and how this logic is used (programmed). Additionally, other facts can determine the performance of a design: • Routing easiness of the design (related to the percentage of the FPGA being used). • Use of embedded elements. • Code quality. • Hard dependencies in the logic. These issues affect both the capability and performance of FPGAs. With respect to the first issue, as nowadays FPGAs have a huge amount of resources it usually does not represent a big technical problem but it can be of great concern with respect to economic issues due to the high cost of the largest FPGAs [CA07]. With respect to the performance that can be achieved, it cannot be exactly predicted before the design is completed, making difficult to determine if an application is suited for FPGAs or not. How- ever, if the design only requires characterized cores or well-known structures, the performance can be inferred through the expected clock frequency and throughput.

1.1.3.3. Methods, algorithms and techniques suited for FPGAs

The FPGA’s unique combination of features and resources makes necessary to reevaluate the meth- ods, algorithms and techniques used for computing to decide which ones are the most suited for FPGAs [HVG+07b]. FPGAs offer to the designer total flexibility for the selection of signal bit-widths, the arithmetic used, the operations or tasks carried out, the design of the datapaths, etc. Furthermore, the inherent parallel nature of FPGA architectures opens the design possibilities, allowing chained datapaths with control instructions carried out in parallel. Adding these features to the set of resources available in current FPGAs we found that the well established software computing paradigms have to be reevaluated. The techniques and methods con- sidered optimal for software may be not optimal for FPGAs, or may require changes or even im- provements to take advantage of FPGA’s features. This way, the fully exploitation of FPGAs will not only require the adaptation of methods and algorithms but also the development of new techniques specially suited for FPGAs. 8 Introduction

1.1.3.4. Design Tools

Hand-written RTL development and debugging are too time-consuming and error prone. The avail- ability of good high-level design tools is one of the major lacks that a designer has to face up when developing a hardware accelerator. Their use can substantially shorten the development cycle, ab- stracting the designer from low level details, helping in the debugging of the design, and automatically carrying out tedious tasks. Additionally, these tools should provide useful features as:

• Quick design space exploration. • Verification aids (test bench generation, software-hardware comparison). • Support for different arithmetics. • Automated hardware-software integration.

Currently, the availability of this kind of tools is very limited. The focus is on translating C code to synthetizable RTL code [Tec, Gra]. The possible reduction of acceleration related to the use of these tools is worth being paid in order to shorten the design time. It is not an objective of this Ph.D. Thesis to develop this kind of tools, but a particular interest will be devoted to put forward the special needs that the development and design of FPGA-based accelerators requires.

1.1.3.5. Hardware-Software co-design and integration

When developing a hardware accelerator we are dealing with all the major issues related to hardware- software co-design. First, it has to be decided which parts of the code are going to be executed in the accelerator and which ones remain in software. Second, the communication mechanism between software and hardware has to be defined and implemented. Third, software and hardware have to be integrated. These tasks are complex and require expert designers because they have a strong impact on the performance of the accelerated application. With respect to the original code, the designer has to evaluate not only which parts of the code are most suited for the accelerator and which ones should remain in software, but also the impact of the data transfers associated to the tasks carried out in the hardware. These data transfers may become a bottleneck in the system, degrading the global performance. An efficient communication mechanism is required for the synchronization of software and hard- ware and for the data transfers. From the software point of view, this implies a computational over- head that must be as small as possible. From the hardware point of view if the communicating mech- anism is not efficient it could imply that the hardware does not work at its maximum performance when the hardware accelerator has to wait to transfer data. 1.2 Objectives and Thesis Structure 9

Finally, the software and the hardware must be integrated, involving complex and tedious tasks (the modification of the software to invoke the hardware replacing the accelerated code, a driver, a low-level library to control the driver from the software and the hardware itself, etc).

1.2. Objectives and Thesis Structure

Studying in depth each one of these five challenges related to hardware acceleration is not feasible in just one Thesis. The great variety of applications that can be accelerated and the different features among them imply that the complexity of each task is high. Therefore, in this Thesis we have chosen one subset of applications to be studied, dealing with the implementation of a real application of this subset. Selecting a complex subset of applications, in our case Monte Carlo simulations, allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a particular application. This subset of applications has several features shared with other appli- cations that allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a given application. Specifically, we have selected a financial applica- tion, the Monte Carlo based LIBOR Market Model. Financial simulation is a remarkable research field for FPGA acceleration of Monte Carlo simu- lations where obtaining quick and accurate results is essential. However, most financial models are computationally intensive and acceleration is necessary to obtain their results with the required quick- ness. Additionally, there is a great variety of financial models allowing the selection of a complete model where different hardware acceleration issues can be studied. The focus of this work is on identifying and providing the key elements of a hardware accelerator and incorporate them in the target application characterized by high complexity and hard timing requirements. Furthermore, the integration of the hardware and software parts will also be addressed in depth. In this section we will start by presenting Monte Carlo basis and the target application to identify the elements that play a key role in its hardware acceleration. Next, these elements will be used as base to formulate the main objectives of this Thesis.

1.2.1. Monte Carlo Simulations and Target Application: LIBOR Market Model

Monte Carlo simulation is often the only tool for dealing with otherwise intractable problems in the areas of scientific calculation or stochastic problems. A particular case of these intractable problems is financial simulation (the simulation of pricing of financial derivatives), characterized by its extremely complex models and hard execution time constraints. Both requirements have been identified as especially challenging in the case of FPGA-based acceleration, being therefore, the application case 10 Introduction used as benchmark in this Thesis.

1.2.1.1. Monte Carlo Basis

Monte Carlo simulations rely on the use of random numbers to evaluate mathematical expressions. In these simulations, the main state variables of the system under study are sampled using random number generation and the evaluation of a mathematical expression is repeated several times with different random numbers. Finally, the results are generally obtained as measures of the probability distribution of the system properties. The accuracy of these methods is ensured by the Law of Large Numbers, that guarantees the convergence of the method when the number of simulations grows to infinite. For a finite number of replications an error is introduced due to the existence of variability in the final result. Monte Carlo methods were firstly developed [MU49] by physicists at Los Alamos Laboratory in the context of the Manhattan project. Now, they are widely used to solve three types of problems: mathematical problems whose analytical expressions are very complex (as certain multidimensional integrals), modeling of physical phenomena where there is uncertainty, and finally, systems with a with a large number of coupled degrees of freedom. Among from the last two types of problems, special attention has been paid to the simulation of physical systems such as molecular dynamics [NBK+09], quantum systems [WAG09] or ray tracing, and the case where we have focused simulation of financial systems [Gla04], where Monte Carlo is used to value portfolios, interest rate options and other financial products, or insurance simula- tions [PHT+07].

1.2.1.2. Specific Design Challenges

Even though Monte Carlo methods are used to resolve very different problems, all these simulations have three shared features that convert them in complex applications that are also computationally very intensive:

• Use of random samples. • Use of complex mathematical expressions. • A large number of replications.

These features make them ideal candidates for hardware acceleration. Furthermore, the complex- ity may be even increased due to:

• Complex distributions for the random samples (Gaussian or Log-normal). • Variance reduction techniques. 1.2 Objectives and Thesis Structure 11

• Complex floating-point operations. • Datapaths requiring control with added complexity (data dependencies).

1.2.1.3. LIBOR Market Model

The Libor Market Model (LMM) [BGM97] is a model for pricing interest rates derivatives. This model involves several variables to be calculated before obtaining its main variables, the LIBORs. The calculation of these variables can be done with different models and complexities, even needing exponentiation operators and vector products. Therefore, it provides a perfect scenario to explore different solutions for this type of complex operators and structures. In chapter 5, the LMM will be explained in detail. However, we can advance the main features of the model from the perspective of an FPGA implementation:

• High quality Gaussian random variables are required. • The use of Variance reduction techniques is mandatory to reduce the simulation time. In par- ticular we will study Latin Hypercube. • High accuracy requirements: huge number of replications and floating-point arithmetic. • Complex floating-point operations as exponentiation function are required. • There are computationally intensive routines as correlation of variables. • Presence of complicated data dependencies. • Complex control due to different simulation scenarios synchronization

In this way, the LIBOR Market Model is a demanding benchmark, ideal for our purpose of study- ing the identified design challenges related to FPGA hardware acceleration, and, in addition, a good application to research if FPGA acceleration is a feasible solution for financial simulation.

1.2.2. Objectives

As just mentioned, Monte Carlo Simulations are especially well suited for being accelerated using reconfigurable hardware. Nevertheless, developing an FPGA Monte Carlo simulation from scratch is almost impossible or it is very time consuming. However, if the designer can use predesigned libraries of the most common elements, the development time can be substantially reduced, and this is the first focus of this Ph.D. Thesis. Following this scheme, one of the main objectives is to study the common elements that play a key role in Monte Carlo simulations and in our target application. Two common elements can be outstood: first the random number generators that are required for the underlying random variables, 12 Introduction and second, floating-point operators that are the base elements for implementing the mathematical models that are evaluated. In this way, the first objective of this Ph.D. Thesis is the study, design and implementation of random number generators. In particular, we focus on Gaussian random number generation and the implementation of a complete generator compatible with variance reduction techniques that can be used for our target application and for other applications. Meanwhile, the second objective deals with the implementation of efficient and FPGA-oriented mathematical operators (complex and using floating-point arithmetic). We focus on the design, devel- opment and characterization of libraries of components. Instead of focusing on the algorithms of the operators, our approach will be the study of how the format can be simplified to obtain operators that are better suited for FPGAs and present better performance. One important goal here is to achieve libraries of general purpose components that can be reused in several applications and not just in a particular target application. Related to the target application, a third objective of this work is to deepen on the implementation of a particular operator, the exponentiation function. This operator is required in many scientific and financial simulations. Its complexity and the lack of previous general purpose implementations deserves special attention. The next objective is related to the global purpose of the Thesis of validating all the previously developed elements for the implementation of a complex Monte Carlo simulation, that involves all the features that can be found in Monte Carlo simulations. In this way, we will deal with the imple- mentation of the target application, the LIBOR Marker Model. Special attention will be devoted to all the features, requirements and circumstances that affect to the performance of the accelerator. To validate all the research done, the next objective is to obtain the experimental results provided by the developed FPGA accelerator which are validated against the original software implementation. Three main features will be analyzed:

• Correctness of the results obtained.

• Accuracy.

• Speedup factors obtained by the global application and by each of the main components.

Finally, the last objective is the integration of the hardware accelerator within the original software application. All issues related to the communication mechanism are studied putting special focus on how performance is affected by data transfers and by the hardware-software partitioning policy implemented. 1.2 Objectives and Thesis Structure 13

Figure 1.1: Thesis Structure.

1.2.3. PhD Thesis Structure

As mentioned before, the scheme followed in this Thesis is a bottom-up methodology with respect to the target application, starting by the components and cores needed, following with the implementa- tion of an accelerator and finishing by its integration with the original software application. The structure of this Thesis follows this methodology and we have organized this document in five main chapters each one corresponding to one of the key research objectives previously identified: 2. Random number generation. 3. Libraries of floating-point operators. 4. Exponentiation function. 5. Development of the target application (LIBOR Market Model). 6. Hardware-software integration. Figure 1.1 shows how these objectives interact and in which chapters they are discussed. The first three objectives tackle the study, design and implementation of the common elements that are required in Monte Carlo simulations, random number generators and mathematical operators, Chapters 2, 3 and 4. Three main features are searched for: obtaining high performance cores, reusability and specific design to take advantage of the FPGAs set of resources. In particular, Chapter 2 focuses on Gaussian random number generators (they also comprise uni- form random generators) as they are widely used on Monte Carlo simulations. Three main topics are studied with respect to FPGA acceleration: uniform random number generators, gaussian random number generators and variance reduction techniques. Finally, Chapter 2 ends with the implemen- tation of a parameterizable gaussian random number generator compatible with variance reduction techniques that will be the base for the random generation of the target application. Regarding mathematical operators, Chapter 3 concentrates on floating-point arithmetic as this is the arithmetic required for many scientific and financial Monte Carlo simulations. Floating-point 14 Introduction standard and format are studied and several simplifications of are proposed. Additionally, the capa- bilities of FGPAs to implement data chains with a high number of operators are studied. Meanwhile, in Chapter 4 we analyze in further detail a complex operator as exponentiation focusing on how we can take advantage of FPGA flexibility to ensure an accurate result. Once we have studied the key elements for any Monte Carlo simulation, random generators and mathematical operators, the next issue we face is the implementation of an accelerator corresponding to a real application and the challenges this implies, Chapter 5. In this chapter, we study how a complex model can be implemented in a hardware accelerator and all the limitations and restrictions that we have to face up. Finally, the accelerator has been integrated within the original application in a personal computer system, Chapter 6. In this chapter, the software-hardware partition policy that we have followed for the implementation of the LMM core can be found. Each chapter is almost a complete study of the topic under analysis, comprising theory, study, design, implementation and results. In the same way, each chapter also includes its own review of related works. 2

Random Number Generation

As in other stochastic simulations, a key element for any Monte Carlo simulation is a good random number generator to sample the variables under study. Even more, good quality random numbers are a must, since the quality of the results obtained is directly related to the quality of the random numbers used. Depending on the nature of the model simulated using the Monte Carlo method, the required random numbers will follow specific probability distribution functions. However, almost all random number generation methods rely on the use of a base uniform random number generator (URNG) whose samples are transformed into the target distribution following some method or equation. There- fore, using a good URNG is a key issue for any random number generator (RNG). When random numbers are related to the Monte Carlo method, besides the quality of the numbers and the specific distribution needed, one more fact has to be taken into account: the compatibility with variance reduction techniques. The huge number of replications of the model needed in Monte Carlo simulations impacts directly in the total simulation time required. These techniques have been developed to reduce the number of replications and hence to reduce the total simulation time, so random number generation have to be studied taking into account these techniques. 16 Random Number Generation

In this chapter random number generation is studied from the global perspective of Monte Carlo simulation and considering FPGA implementation issues. First, an overall introduction to RNGs is provided, following with an analysis of URNGs. Afterwards, Gaussian RNGs are studied, as the gaussian distribution is one of the most common distributions for Monte Carlo simulations and is the one required for our target application, the LMM. Then, variance reduction techniques are introduced. Hardware related issues to all these topics are provided while some generators for FPGAs are developed. The main objective of this chapter is focused in this point: the study and implementation of FPGA generators oriented to hardware acceleration. In this way, special attention will be devoted to the implementation of a parameterizable gaussian RNG compatible with variance reduction tech- niques designed specifically to be used in accelerators. This element is not only a key component for our selected application and many other Monte Carlo simulations, its quality is also determinant for the accuracy of the simulation results. Hence, fulfilling high quality requirements will be another important issue in this chapter. As is exposed in Section 2.3.2, in the literature we cannot find any FPGA Gaussian RNG that fulfils this criterion in combination with all the other criteria that we have identified for a Gaussian RNG. In this way, a new gaussian generator is developed focusing in three main components:

• The gaussian generation method selected, the inversion method, and how we adapt it to FPGAs.

• A high-performance uniform RNG, a Mersenne Twister one, to be used as base generator for the inversion-based gaussian RNG.

• A parameterizable variance reduction techniques core to be used in combination with the uni- form RNG.

2.1. Random Number Generation: Overall Introduction

An ideal RNG fulfills two main characteristics. Firstly, it generates random numbers whose distri- bution follows exactly the target distribution. And secondly, the random numbers should be uncorre- lated, in other words, mutually independent. However, except for some rare generators based on physic events like radioactivity or the elec- tron’s spin and thereby not feasible to be used here, both hardware and software conventional random number generators cannot completely fulfil the second characteristic. Generators rely on equations which use previously generated random numbers or events that are not completely random. Consequently, it is more adequate to talk about pseudo-random generators, and taking this into account, the main feature that a good generator should accomplish is that, for the given quantity of random numbers needed, the generator behavior resembles the behavior of an ideal generator. 2.2 Uniform Random Number Generation 17

Obviously, this is not a measurable characteristic, but an idea of how good a random number is can be obtained by looking at:

1. Periodicity: algorithm-based generators have a period. The sequence of generated numbers starts repeating after a period since algorithms are deterministic and have a finite number of states.

2. Randomness: the generated sequence must behave as a truly random sequence. Randomness is not a quantifiable parameter but there are two methods to evaluate it:

• Theoretical properties of the algorithm. • Statistical tests.

Additionally, other important features to consider for a random number generator are:

1. Reproducibility: capacity of generating the same sequence again. In algorithm-based genera- tors if the same seed is used, the sequence will always be the same.

2. Speed: number generation throughput. Generation speed is very important as many applica- tions demand a huge number of random values for simulations.

3. Portability: the same generator should produce the same sequences of numbers on different computing platforms (either Hardware-based or Software-based platforms).

2.2. Uniform Random Number Generation

Uniform Random Generators (URNG) which provide a uniform distribution, and in particular over the interval [0,1), are the most common generators, as generators following other distributions almost always need a URNG as a base generator. Computer URNGs are mainly based on algorithms relying on integer operations that generate a uniform distribution of integers in the interval [0,m) and then scale them to [0,1) by dividing by m. More complex generators follow the same scheme except that they combine the results from several basic generators before the scaling in order to improve the theoretical properties of the basic algorithms. Developing good URNGs is quite easy (both in hardware and software) and it has been studied in depth [L’E97]. Multiple different URNGs can be found in the literature based on different methods: linear congruential [Leh51], Mersenne Twister [MN98] or Tausworthe [Tau65] generators to name a few. Most of them, although only involving bitwise operations or simple equations, present good quality (high period and good randomness), being the random samples efficiently generated without increasing the complexity of the Monte Carlo simulation. 18 Random Number Generation

In the research field of development of specific URNGs for FPGAs, several works from D. B. Thomas et al outstand as [TL07, TL10], where several URNGs are proposed, studied and developed featuring the specific resources and architecture of FPGAs. The development of new URNGs has been widely studied in the last years and it is out of the scope of this Thesis. However, a brief explanation of the most common methods is presented next to explain why we have selected the Mersenne Twister generator and its advantages with respect to other generators.

2.2.1. Linear Congruential Generators (LCG)

LCG are recurrences with the following form:

xi+1 = (axi + b)modm (2.1) x u = i+1 (2.2) i+1 m where a, b and m are positive integer constants. In this method, as in all others that use recurrence, it is necessary an initial value known as seed x0 (between 0 and m-1), to start generating values. As a recurrence-based method the sequence of random numbers is generated from the seed, in this case following the previous equations. The use of a seed and an equation makes that these generators present two common features:

• For the same seed, the generated sequence is always the same as the algorithm is deterministic.

• As a generated value only depends on the previous generated number, when the seed is repeated, the sequence of numbers starts repeating from that value. In this way, these generators are periodic (m is not infinite, and it is the maximum number of different values) and this period is independent from the seed used.

There are two types of LCG which differ in the value of the constant b. When b is equal to 0 the generator is pure, while if b is not 0 then it is a mixed LCG. For each type there are some conditions to ensure that the LCG has s full period of m numbers (all numbers between 0 and m − 1 are generated before any value is repeated).

2.2.1.1. Pure LCG

x x = (ax )modmu = i+1 (2.3) i+1 i i+1 m 2.2 Uniform Random Number Generation 19

As b is 0, pure LCGs generate uniform values in the interval (0,1). Zero is not achieved because once it is obtained, all subsequent values will also be zeros. Hence the full period has m − 1 values. Pure LCGs are also known as multiplicative LCGs because the conditions between a and m to ensure the full period are multiplicative conditions:

• m is a prime number.

• a is a primitive root of m:

– am−1 is a multiple of m. – aj−1 is not a multiple of m for j=1,2...m-2.

• x0 ≠ 0 (if seed is 0 all generated numbers are 0).

2.2.1.2. Mixed LCG (MLCG)

x x = (ax + b)modmu = i+1 (2.4) i+1 i i+1 m

Now b is not zero, so the full period includes zero as a valid value [0,1). The conditions to ensure the full period are:

• b and m are relative primes (their only common divisor is 1 → gcd=1).

• Every prime number that divides m divides a − 1.

• a − 1 is divisible by 4 if m is divisible by 4.

2.2.1.3. Problems related to LCG

LCG have two main problems. The first one relates to the randomness properties of the method. Each value of the sequence is obtained directly from the previous one, so the randomness achieved is not very good as the correlation between numbers can be high. The second problem is a computational problem. To achieve very high periods very high values of m are necessary and this can create overflow problems in the multiplication. To solve both problems, it is common to employ more complex generators, combined generators, based on the LCG recurrence. There are two types of combined generators. One type uses several previous values to generate the next one (instead of using only the previous one). The other type generates the random variable by combining several LCGs (simple or combined ones). 20 Random Number Generation

2.2.2. Combined Generator Rand2

One of the combined generators with more quality and more extensively used is the one known as rand2 (as it is referred in [PTVF88]). This generator is based on the combination of several multiplicative LCGs [L’E88] according to the next equation:

∑l j−1 xi = ( (−1) sj,i)mod(m1 − 1) (2.5) j=1 where l generators are combined with a periodicity: ∏ l (mj − 1) ρ ≤ j=1 (2.6) 2l−1 In the specific case of rand2, two multiplicative LCGs are combined, while the algorithm is ad- ditionally improved with a mixing technique to mess up the sequence of the generator: the Bays- Durham shuffle [BD76]. This technique consists on the storage of a group of calculated random variables in a table and the random reading of one of them. In this way, in each iteration a random variable is read from a position of the table while the random variable calculated in that iteration is stored in the same position. Thus, the mess up affects both, the sequence of output numbers and the generation of the next value as it depends on the value read in the previous iteration.

2.2.3. Tausworthe Generators

2.2.3.1. Tausworthe-LFSR (basic)

The basic Tausworthe generator is a LCG which combines several values of the recurrence to generate the next value following the equation:

bi = (a1bi−1 + a2bi−2 + ... + akbi−k)mod2 (2.7) where the ai coefficients are equal to 0 or 1 and the bi values are also binary values [Tau65]. As the modulus is a prime number, the linear recurrence is determined by the polynomial:

k k−1 P (z) = z − (a1z + ... + ak) (2.8) and, if P(z) is a primitive polynomial, the linear recurrence will have the full period ρ2k − 1. In this way uniform numbers can be obtained with the equation: 2.2 Uniform Random Number Generation 21

∑L −i un = bn+i−12 (2.9) i=1 where L is the precision of the number that we want to generate. The problem with this type of implementation is that the uniform variable is generated from only one step of one linear recurrence and thereby all its bits keep some correlation between them.

2.2.3.2. Combined Tausworthe

To resolve the above mentioned problem, several options can be considered. First, each bit for ob- taining un can be from a different independent linear recurrence.

∑l −i un = bn,i2 (2.10) i=1 for l independent linear recurrences.

A second option is to obtain un from only one linear recurrence but from s steps of that recurrence:

∑L −i un = bns+i−12 i=1

k In this case, if ρ = 2 − 1 and s are coprime, then un will have a full period equal to ρ. To generate un from un−1, s steps from the linear recurrence are needed. But if some conditions are fulfilled [L’E96]:

• P(z) is a primitive trinomial with the form:

k q P (z) = z − z − 1 => bi = bi−k+q + bi−kmod2

• 0 < 2q < k.

• 0 < s ≤ k-q< k ≤ L. these s steps can be obtained very quickly, see Section 2.5.1.1. Finally, several generators following this scheme can be additionally combined together in just one generator making an XOR bitwise operation of the un of each generator [L’E96]. If their polynomials are coprime, the period of the combined generator will be the least common multiple of the periods of the individual generators. 22 Random Number Generation

2.2.4. Mersenne Twister

The Mersenne Twister [MN98] (MT) URNG presents very high quality while achieving a huge pe- riod, 219937 −1 in its most used configuration. Nowadays, this URNG is widely used for Monte Carlo simulations due to its high quality and high performance. In this generator, groups of w bits are handled as vectors and a linear recurrence is applied to that vectors instead of to single bits:

⊕ u| l xk+n = xk+m (xk xk+1)A (2.11) where x vectors are formed of w bits. Finally, to improve the statistical properties of the generator, the output random numbers of the generator are not the vectors of the linear recurrence. Instead, the numbers generated in the recurrence are modified, tempered, with a bitwise multiplication by a w × w binary matrix. In Section 2.5.1.2, a more in depth analysis of this generator can be found.

2.3. N(0,1) Gaussian Random Number Generation

Most Monte Carlo computationally intensive simulations require of a high quality Gaussian Random Number Generator (GRNG), in particular with Normal distribution N(0,1). Furthermore, the software complexity of this type of generators makes them ideal candidates for hardware acceleration. There- fore, developing a high-quality high-performance hardware GRNG is essential for any Monte Carlo hardware accelerator.

2.3.1. Generation methods

Implementing a N(0,1) gaussian random number generator can be done using several methods as acceptance-rejection, Wallace [Wal96], Box-Muller [BM58] or inversion [BFS83]. Furthermore all of them are suitable for FPGA implementations [ZLL+05, LLZ+05, LLVC06, LCVL06] (in [LCVL06] a results comparison between them can be found). Next, the main features of these methods are briefly introduced.

2.3.1.1. Acceptance-Rejection Methods

These methods are based on the generation of another distribution, similar to the target one, which can be easily generated. The samples from this base distribution are candidates for generating the target distribution and will be accepted or rejected, subsampled, according to a mechanism designed to select candidates of the target distribution. 2.3 N(0,1) Gaussian Random Number Generation 23

The main features of these methods are determined by the inherent nature of the method. On the one hand, the complexity of generating some distributions is reduced as a more easily generated base distribution is used for obtaining the candidates. On the other hand, the samples of the base distribution will be rejected with a probability directly dependant on the difference of density between both distributions, and, therefore, a constant throughput of generated samples cannot be ensured.

2.3.1.2. Wallace Method

The Wallace method is based on the idea of obtaining gaussian variables in the same way as uniform variables are obtained from previous uniform variables, following some recurrence. In particular, this method transforms a vector of K gaussian variables into K new ones using an orthogonal transfor- mation with a matrix. One of the main features of the Wallace method is that it requires an initial pool of normalized gaussian variables whose average square value is one. This feature implies the need of a scaling factor to correct the value of the generated variables. The other main feature is that the correlation between variables has to be carefully handled as new gaussian variables are generated from previous gaussian variables.

2.3.1.3. Box-Muller Method

The Box-Muller method employs a straightforward transformation to convert two uniform variables, u1 and u2, into two gaussian ones, x1 and x2, following the equations:

√ √ x1 = −2 ln u1 cos(2πu2)x2 = −2 ln u1 sin(2πu2) (2.12)

These equations determine the main feature of this method, the use of complex operators to obtain the gaussian variables, and its main difference with other methods, in each iteration two gaussian variables are obtained instead of one.

2.3.1.4. Inversion Method

The inversion method is a general method to generate any probability distribution using the inverse function of its corresponding cumulative distribution function (CDF) and uniform variables. The uniform variables correspond to values of the cumulative probability, so the variables of the desired distribution are obtained calculating the values that generate that probability with the inverse function (ICDF). This conversion will be addressed with more detail in Section 2.3.3. Therefore this method implies a direct transformation of uniform variables into the desired vari- 24 Random Number Generation ables, and its main features cannot be generalized as they will depend on the target distribution and how the inverse function is implemented. However, one very important feature stands out, as the basis of this method is a one to one transformation using the ICDF, the obtained sequence of variables with the desired distribution will keep the structural properties of the uniform sequence.

2.3.2. Monte Carlo Implications and Hardware Implementation

Due to the stochastic nature of the Monte Carlo simulation and according to the Law of Large Num- bers, the convergence of a Monte Carlo simulation is ensured when the number of replications of the system under study grows to infinite. For a finite number of paths an error can be introduced due to the existence of variability in the final result. This variability strongly depends on the variance of the underlying variables. For example, if the arithmetic mean of a function f of the system is to be obtained, its expression, for m simulations, would be:

1 ∑m µ = E[f] ≈ f (2.13) m i i=1 where fi is the evaluation of f for each replication. According to the Central Limit Theorem the standard deviation of the approximated value of µ would be:

Std[f] σ = Std[µ] = √ (2.14) m

The standard deviation implies a confidence interval for the approximated value obtained and the smaller is the interval more accurate is the value. The confidence interval can then be reduced by either increasing the number of replications m or reducing the variance of f. Increasing the number of replications m has a very important drawback. The statistical nature of these methods makes this type of simulations computationally very intensive. The calculation time grows linearly with the number of replications but the standard deviation decreases only as the square root of it. Therefore reducing the result deviation by increasing m will have an important impact on the execution time. However, this increase has a very little impact over the standard error when m is big enough because √1 m decreases very smoothly (Figure 2.1). The second alternative, reducing the variance of f, relies on a set of methods known as Variance Reduction techniques [Gen98, Gla04]. These techniques are based on doing a smart sampling on the space of the underlying random variables so the variance of f is reduced thus reducing the number of replications required and the execution time needed. Meanwhile, these samplings have a limited impact on the calculation time. In Section 2.4 variance reduction techniques will be shown in depth. As the main problem we are dealing with is calculation time, the compatibility of the gaussian generation method with these techniques becomes a must. 2.3 N(0,1) Gaussian Random Number Generation 25

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of simulations

√1 Figure 2.1: Plot of m .

Also related to the calculation time, two more issues have to be handled. To completely approach the FPGAs capacities for large pipelined datapaths, one important feature for a hardware GNRG is that it must generate one random number per clock cycle so the datapath is not halted waiting for new random numbers. Furthermore, the frequency at which the random numbers are generated should be high so the performance of the whole datapath is not limited by the GRNG. In a hardware implementation, this is closely related to the fact that a generation method may be pipelined or not. Finally, one more issue has to be studied: which arithmetic is more suited to generate the random numbers. Gaussian distribution variables are concentrated around the zero value, reaching values like 5.8 × 10−8 (32-bit integer uniform input with the pre-scaled value of 231 + 20), while in the tails of the distribution the gaussian distribution reaches up to 6.23. Therefore the arithmetic employed must provide enough resolution to represent those values with high accuracy and precision considering that the most significant bit can be displaced on more than 20 bits. Hence, we have considered that the generation method should be compatible with floating-point arithmetic as it adapts better to this range of values than fixed-point arithmetic (that will require of a high precision to guarantee high quality samples). In summary, four implications or criteria have to be considered when selecting the most suited method for random gaussian generation:

1. Compatibility with variance reduction techniques. 2. Possibility of generating one sample per cycle. 3. High clock rate or possibility of pipelining to achieve it. 4. Compatibility with floating-point arithmetic.

Due to the combination of these four criteria, we have selected the inversion method as the gen- eration method most suited for Monte Carlo based simulations as it is the only method which can fulfill the four criteria, see Table 2.1. Additionally, the inversion method has another advantage, it is a general method suitable for other target distributions, so the framework developed for an inversion 26 Random Number Generation

Table 2.1: Gaussian Generation Methods-Selection Criteria. Acc-Rej Wallace Box-Muller Inversion Variance Reduction Hard Hard Hard Easy Sample/Cycle No Yes Yes Yes Pipeline Yes No Yes Yes Floating-point Yes No Yes Yes

GRNG can be reused for RNGs of other distributions. With respect to the first criteria, the Inversion method is the only one where variance reduction techniques can be applied on the base uniform variables, while in the other three methods they should be applied over the gaussian variables, Sections 2.3.3 and 2.4. Regarding the second criterion, the acceptance-rejection method cannot ensure one sample per cycle as some samples are rejected. Furthermore, the application of variance reduction techniques over the gaussian variables instead of the base uniform can have an impact on this criterion. The third criterion is fulfilled by all the methods except the Wallace one. Due to the nature of the recursion it uses, if pipelining is introduced, it cannot achieve one sample per cycle. In the other three methods, all the involved mathematical operations can be pipelined, being their base generators (uniform for Inversion and Box-Muller, or the one selected in Acceptance-Rejection) their only part that cannot be pipelined in case they depend on a recursion based on the last generated value. As this base generators are usually very fast, the do not compromise the global clock frequency. Finally, again the Wallace method is the only one that does not fulfill the last criterion. Its recur- sion is between fixed-point arithmetic gaussian samples whose conversion to floating-point samples will imply a degradation of the distribution quality. Although the inversion method can fulfil all the criteria, this is not the case for the inversion-based gaussian RNG that is found in the literature, [LCVL06]. This particular implementation is a fixed- point implementation whose samples cannot be converted to high quality floating-point samples due to its limited quality and resolution, see more details in section 2.5.2.3. Therefore, we have developed a new inversion-based generator. Next, the theoretical approach to the approach we have followed is studied.

2.3.3. Inversion Method with Quintic Hermite Interpolation

As stated before, inversion is a general method to generate any probability distribution using the inverse function of the cumulative distribution function (CDF) of the desired distribution: 2.3 N(0,1) Gaussian Random Number Generation 27

Figure 2.2: Inversion Method.

∫ x CDF (X ≤ x) = F (x) = f(x)dx (2.15) −∞ X = F −1(U) (2.16) where U is a uniform distribution from which the desired distribution, f(x), is obtained being F(x) the cumulative distribution of f(x). In Figure 2.2 the method is depicted for N(0,1). The generated uniform variables correspond to values of the cumulative probability (axis y in the GCDF graphic of the figure). Then the distribution variables are obtained calculating the x values that generate that probabilities with the inverse function. Although this method relies on this simple idea, it has a great advantage: this kind of transforma- tion of uniform samples into the target distribution samples keeps the structural properties of the the uniform sequence, and thereby variance reduction techniques can be directly applied to the uniform variables. With this method, any distribution can be generated if its F −1 is known. However F −1 can be a very complex function, as for Normal Gaussian distribution, N(0,1), so it has to be approximated and numerical algorithms are needed for its computation. 28 Random Number Generation

Table 2.2: Gaussian ICDF Implementation Requirements. √ +/− × / exp ln erfc−1 37 34 3 2 1 2 Interpolation degree n n n - - - -

2.3.3.1. Direct Inversion

The inverse function approximation can be done in several ways and the most direct one is to do a general approximation, where F −1 is approximated by just one function valid for any value in the range (0,1). This method is the most intuitive one, and for any uniform (0,1), the target distribution variable is obtained just by computing F −1. The problem is how to obtain F −1. In our case, the inverse function of the CDF of N(0,1), where CDF is:

1 x 1 x CDF (x) = (1 + erf(√ )) = erfc(√ ) (2.17) 2 2 2 2 where erf and erfc are the error function and the complementary error function. To obtain the inverse function of the CDF it is necessary to obtain the inverse of the erf or the erfc functions. However, both functions are very complex to calculate. In particular, and as can be seen in Table 2.2, the inverse erfc function calculated with the method developed in [Mor83] requires a huge number of operations and among them we can find even exponential and logarithm functions.

2.3.3.2. Interpolated-Segments Inversion

Another way to compute the inverse function is to approximate it by segments or intervals. In this case, the range of the inverse function is divided into several segments, and then the inverse function is approximated in each segment by interpolation following a polynomial equation. Unlike the previous method, interpolation inversion is not a direct method. Before the inversion is calculated for any uniform value, the segment that corresponds to the uniform input value has to be determined. Then, the inversion is made by applying the interpolation of that segment to the uniform value. The most common technique to carry out interpolated segments inversion is to use the same type of interpolation for all the segments. Thereby, the same general inversion function is used for all the segments, differing the polynomial coefficients for each segment but not the operations. A very important advantage of this method is that the coefficients, for any type of interpolation, do not change while the starting and ending points of each segment do not change. This means that they do not have to be calculated in execution time, so we can calculate them in advance and store 2.3 N(0,1) Gaussian Random Number Generation 29 them in memory tables. In execution time, for each uniform value, the memory tables are accessed to read the coefficients corresponding to its calculation segment and are then applied to the interpolated inverse function. The overall characteristics of the inversion by interpolation will depend on the interpolation method, on the selected degree selected and on the segmentation policy. These features will de- termine the quality of the inversion (how accurate the approximation is) and how many segments are needed for a given accuracy. There exist several methods for splines interpolation as:

• Chebyshev

• Legendre

• Jacobi

• Hermite

From them, Hermite interpolation outstands due to three facts [HL03]. Firstly, the better results obtained for the same interpolation degree with respect to the other methods, as it takes into account the density of the function. Secondly, Hermite interpolation is a local approximation and its accuracy can be improved just by introducing more segmentation points where needed without needing to recalculate all the segments. And finally, F −1 is a monotonically increasing function and Hermite interpolation monotonicity properties ensure the interpolation will also be monotonically increasing. One last advantage is that the calculation of the polynomial constants is easier than with the other methods.

2.3.3.3. Quintic Hermite Interpolation Equations

In [HL03] several polynomial degrees of Hermite interpolation (linear, cubic and quintic) are studied. The quintic interpolation obtains the best results in terms of accuracy (almost exact) and smaller number of segments. Therefore, we have selected quintic interpolation to develop a GRNG.

For each segment of the GCDF [pi,pi+1] where ui = CDF (pi), the inverse function is interpo- lated with a 5 degree polynomial:

5 2 3 4 5 Hi (¯u) = ai0 + ai1u¯ + ai2u¯ + ai3u¯ + ai4u¯ + ai5u¯ (2.18) where u¯ = u − ui with u being the uniform variable to invert, ui the starting point of the segment and ui+1 the starting point of the next segment. For each segment the values of the coefficients are [DEH89]: 30 Random Number Generation

ai0 = pi (2.19) 1 ai1 = (2.20) fi ′ − fi ai2 = 3 (2.21) 2fi 3f ′/f 3 − f ′ /f 3 10∆s − 6/f − 4/f a = i i i+1 i+1 + i i+1 (2.22) i3 2∆u ∆u2 −3f ′/f 3 + 2f ′ /f 3 −15∆s + 8/f + 7/f a = i i i+1 i+1 + i i+1 (2.23) i4 2∆u2 ∆u3 f ′/f 3 − f ′ /f 3 6∆s − 3/f − 3/f a = i i i+1 i+1 + i i+1 (2.24) i5 2∆u3 ∆u4

′ ′ − − ∆p −1 ′′ where fi = f(pi), fi = f (pi), ∆u = ui+1 ui, ∆p = pi+1 pi, ∆s = and (F (ui)) = ′ ∆u −f (pi) 3 . f(pi)

2.4. Variance Reduction Techniques

As introduced in Section 2.3.2, Variance Reduction Techniques are essential to reduce the total sim- ulation time required by a Monte Carlo simulation, as they reduce the total number of replications required to obtain the result with a given confidence interval. To reduce the variance between the re- sults of the different replications, these techniques modify the sequence of random numbers provided by the random number generator to cover the space of the underlying variables in a better way than with a pure random sequence. To do it, variance reduction techniques introduce some dependencies in the random numbers. These dependencies must be introduced across replications, since if they were introduced within replications the result obtained would be absurd. Among these techniques we can outstand:

• Control Variates.

• Antithetic Variables.

• Stratified Sampling.

• Latin HyperCube.

• Importance Sampling.

An extensive description of these techniques can be found in [Gen98, Gla04]. However, in this Thesis we are going to focus on two of them, Stratified Sampling and Latin Hypercube as they are the most general of the techniques and can be applied in a similar way to all Monte Carlo simulations. 2.4 Variance Reduction Techniques 31

2.4.1. Stratified Sampling and Latin Hypercube

The stratified sampling technique is based on the idea of better covering the space of a random variate by dividing it into n subsets or stratas Ai. Let’s assume that f depends on the random variate r. Then the expected value of f is then calculated according to:

n∑−1 E[f] = E[f(r)|r ∈ Ai] · P (r ∈ Ai) (2.25) i=0 where E[f(r)|r ∈ Ai] is calculated by Monte Carlo simulations being P () the probability distribu- tion. The number and definition of the stratas are chosen to minimize the variance of the average estimator. If we choose equally likely stratas the expected value now is given by:

− 1 n∑1 E[f] = E[f(r)|r ∈ A ] (2.26) n i i=0

In this case, if the needed random samples follow a uniform distribution [0,1), their generation to calculate E[f(r)|r ∈ Ai] is easily carried out by means of the following equation:

u i us = i + ; with i = 0, ..., n − 1 (2.27) i n n

i i where each n is the starting point of each strata and the uniform variables ui are scaled to the strata size.

When r does not follow a uniform distribution, the calculation of E[f(r)|r ∈ Ai] by means of the Monte Carlo method is more compromised due to the complexity of sampling r according to its distribution function P [r ∈ Ai]. Stratified sampling can also be extended to multidimensional variates. Each dimension is parti- tioned into n stratas giving a total number of nd stratas, being d the number of dimensions. This makes the method unfeasible when the number of dimensions is high due as the required number of multidimensional variates also will follow this exponential growth. Thee Latin Hypercube [Gen98] can be understood as an alternative extension of stratified sam- pling that does not suffer the problem of dimensionality. Latin Hypercube consists on stratifying the one-dimensional marginals of a joint distribution instead of stratifying the whole multidimensional space. Therefore the number of multidimensional variates grows linearly with the number of dimen- sions. For the sake of illustration we will outline the sampling procedure for a multidimensional independent uniform variate. Let’s call Π0, Π1, ..., Πd−1 to d permutations of {0,1,...,n-1} drawn in- dependently assuming that the n! different permutations are equally likely. Random samples of the multidimensional uniform variate can be obtained according to: 32 Random Number Generation

(a) Stratified Sampling (b) Latin Hypercube.

Figure 2.3: Dimension Impact: Stratified Sampling & Latin Hypercube.

uj + Π (i) vj = i j j = 0, ..., d − 1, i = 0, ..., n − 1 (2.28) i n

j where the ui are independent draws of a one-dimensional uniform random variate. In Figure 2.3 the difference between both techniques is graphically depicted for two dimensions and four strata. As can be seen, with Stratified Sampling and d=2 and n=4 it is required to work in groups of 16 replications as 16 random variables. Meanwhile, with Latin Hypercube only groups of four replications are needed. This feature has a great importance in Monte Carlo simulations (where the model to simulate is a temporary model) where the simulation of each time step requires different random variables. In these cases we need to work with groups of x replications for the total number of time steps, t, and requiring x × t random numbers to whom variance reduction techniques have been applied. This forces high memory requirements and will determine how to simulate, see Section 5.3.1.

2.5. Developed Gaussian Random Number Generator

2.5.1. Uniform Random Number Generator

Inversion-based RNGs inherit the statistical properties of their base uniform random generator. Thereby, one of the characteristics that must fulfill a URNG used in a Monte Carlo simulation is the high qual- ity of the generated random sequences: good randomness, high period and good statistical properties. Among the generators mentioned in Section 2.2, Rand2, MT and the combination of several basic generators, fulfill these characteristics. As exposed in Section 2.3.2 one of the desired features we are searching for the hardware im- plementation is the generation of one random value per clock cycle and with the highest possible 2.5 Developed Gaussian Random Number Generator 33 frequency. Rand2 and the combined generators require of the current generated value to compute the next value and, therefore, if one value is generated per cycle no pipeline can be introduced. This cir- cumstance is not a problem for basic combined generators as they are composed of very fast bitwise operations, but it discards the use of Rand2 as all operations performed in just one cycle are complex and slow (two multiplications, two modulus, two additions, one division and one subtraction). Therefore we have selected two URNGs to develop and use them as base uniform random gener- ators, a combined Tausworthe generator and the MT.

2.5.1.1. FPGA Tausworthe 88

The Tausworthe 88 [L’E96] is a well known URNG which combines three of the basic Tausworthe generators presented in Section 2.2.3.2. The sequence of uniform variables is obtained by combining, using an XOR operation, the three variables obtained from the basic generators. The three basic generators combined in Tausworthe 88 (or Taus88), are three trinomials for se- quences of 31, 29 and 28 bits respectively, with full period and the following set of parameters (k,q,s), see section 2.2.3.2:

• P 1(31, 13, 12)

• P 2(29, 2, 4)

• P 3(28, 3, 17)

Therefore, as the trinomials used are pairwise relatively prime the period of the combined gener- ator is 288. In software, the recurrence of each of the generators can be quickly calculated; given the vector

A with the un−1, the vector B for temporal storage and the vector C containing a mask composed of k ones followed by L − k zeros, un can be calculated in six operations:

1. B ← A left shifted q bits.

2. B ← A XOR B

3. B ← B right shifted k − s bits

4. A ← A&C

5. A ← A left shifted s bits

6. A ← A XOR B

This is even more simplified in hardware, due to the possibility of working at bit level, so these six operations are reduced to just an XOR and the concatenation of bits: 34 Random Number Generation

i − − i − − Un+1(31 : 32 (k s)) = Un(31 s : 32 k) (2.29) i − − i − − ⊕ i − − − − Un+1(31 (k s) : 0) = Un(31 : k s 1) Un(31 q : k s q 1) (2.30) while the complete generator implies two more XOR operations

1 ⊕ 2 ⊕ 3 Un+1 = Un+1 Un+1 Un+1 (2.31)

2.5.1.2. FPGA Mersenne Twister

In the literature some previous work focused on the hardware implementation of the MT URNG can be found in [SK06, CA08, TB09]. The MT generator is highly parameterizable and these works are focused on the most used configuration, MT19937, due to its high quality and the simplifications that its set of parameters introduces. Hence, we also considered that set of parameters as the ideal for FPGA. However, none of the previous implementations fulfills all the characteristics we are searching for. [SK06, CA08] present a low frequency and [SK06] does not provide one sample per cycle. Meanwhile, in [TB09] the first part of the algorithm is not implemented in hardware and therefore it is not a complete implementation. Hence, it requires a HW-SW overhead that precludes us from this implementation. Opposedly, we are looking for a hardware MT with the following features:

1. All in Hardware. 2. Capable of generating one sample per cycle. 3. Efficient in area and performance.

Following the general MT algorithm is described while the simplifications due to the set of pa- rameters of MT19937 are introduced. The algorithm is split in three different tasks:

1. Initialization: the generation from a seed of the first n vectors of the recurrence. 2. Obtaining the linear recurrence. 3. The tempering of the generated variables from the linear recurrence.

Firstly, an initialization from the seed is needed as the linear recurrence requires a work area of n variables. This initialization takes the seed as the first element of the recurrence, x0, while the other n − 1 variables are generated following the recurrence:

xi = 1812433253 × (xi−1 ⊕ (xi−1 >> 30)) + i (2.32) 2.5 Developed Gaussian Random Number Generator 35

Figure 2.4: Mersenne Twister General FPGA Architecture.

Once the first work area is obtained, the initial n variables, the random numbers are calculated following equation 2.11:

⊕ u| l xk+n = xk+m (xk xk+1)A (2.33)

u − l where xk means the w r most significant bits of xk, and xk+1 corresponds to the r less significant bits of xk+1. To make the computation of the multiplication fast, the matrix A is selected in such a u| l way that (xk xk+1)A is reduced to:

u| l (xk xk+1) >> 1 when xk+1(0) = 0 (2.34) u| l ⊕ ((xk xk+1) >> 1) a when xk+1(0) = 1 (2.35) where a is the wth row of matrix A. Therefore, the equation is mostly reduced to XOR operations. Finally, each random number obtained in the linear recurrence is tempered, modified to improve its statistical properties by multiplying the random sample by a matrix T , z = xk+nT . Again, matrix T is selected in such a way that this multiplication is simplified as several logical bitwise operations:

y = xk+n ⊕ (xk+n >> u) (2.36)

y1 = y ⊕ ((y << s) && b) (2.37)

y2 = y1 ⊕ ((y << t) && c) (2.38)

z = y2 ⊕ (y >> l) (2.39)

As just seen, the MT URNG depends on multiple parameters (w, n, m, r, a, u, s, b, t, c, l) 36 Random Number Generation

(a) MT Three port Table. (b) MT Circular Buffer.

Figure 2.5: MT work area Storage. corresponding in MT19937 to the set (32, 624, 397, 31, 9908BODF, 11, 7, 9D2C5680, 15, EFC60000, 18). The hardware implementation of the MT generator has to deal with the three tasks (initialization, recurrence, tempering) while a storage element is needed for storing the samples composing the work area, as can be seen in Figure 2.4. From the previous equations it can be easily deduced that the linear recurrence and the tempering tasks fulfill the criteria of obtaining one sample per cycle while achieving a high clock rate, as they are mostly composed of just XOR bitwise operations. Furthermore, due to the depth of the work area and the dependencies among samples, the logic of both tasks can be pipelined to increase the clock rate. Meanwhile, the initialization task also requires a multiplication and an addition. Although this task will not be working once the first n-sample work area is generated, in reconfigurable hardware the slowest task determines the maximum clock rate. Hence, the clock rate of the MT generator is also determined by the more complex and slower initialization logic and therefore, pipelining this task is a must if a high clock rate is desired. Another important fact to take into account is the storage element. A n-word storage area is needed for the linear recurrence equation, and due to the nature of the linear recurrence, two storage options are suitable, see Figure 2.5: a storage table and a register buffer. In the first case, a three port table is needed. It can be implemented with two dual block-RAMs or distributed logic and the logic required for the indexes. The three ports required are one for reading xk+1, another for reading xk+m and another for writing xk+n, while xk is obtained from xk+1, see Figure 2.5(a). In the second case, a buffer of registers can be used as the relationship between the indexes of the words is fixed and in each step of the recurrence xk is replaced by xk+n in the work area. In this way, the linear recurrence and the buffer of registers can be considered like a circular buffer where the linear recurrence is some combinational logic between the input and the output of the buffer, see Figure 2.5(b). 2.5 Developed Gaussian Random Number Generator 37

Table 2.3: Virtex-4 XC4VFX140-11. Table of resources. Slices DSP 18 KB BRAMs 63168 192 552

Table 2.4: URNG Implementation Results.

Mersenne Twister 19937 Taus88 Logic Init. 1 Cycle Init. 2 Cycle Init. 3 Cycle Init. 4 Cycle [TB09] CB Table CB Table CB Table CB Table V4-FX100 Slices 77 73 807 161 819 158 816 153 816 183 128 BRAM - - - 4 - 4 - 4 - 4 4 DSP - - 3 3 3 3 3 3 3 3 - Start Cycles 1 - 625 625 1249 1249 1873 1873 2497 2497 624 MHz 943.4 646.8 108.2 105.1 214.5 185.3 256.3 314.2 339.3 345.9 265

2.5.1.3. Implementation Results

The Xilinx Virtex-4 XC4VFX140-11 will be the reference FPGA for the implementation results of the subcomponents and cores. Meanwhile, for the whole accelerator implementation the reference FPGA will be the Xilinx Virtex-5 XC5VFX200T-2, Chapters 5 and 6. In Table 2.3 a summary of the resources of the Virtex-4 is available. In Table 2.4, the results for the developed URNGs are summarized. The abbreviations CB and Table refer to the circular buffer and the three port table MT implementations respectively. Addition- ally, the results for the most representative FPGA MT in the literature, [TB09], are presented (results from [SK06, CA08] are not included since they are incomplete and their clock frequency is below 40 MHz). The logic column of the MT generator comprises the results for just the combinational logic of the linear recurrence and the output tempering (introducing a register between them), without taking into account the logic needed for the initialization nor the storage element. Hence, this frequency result represents the maximum clock frequency achievable for a MT with a throughput of one sample per cycle. In the complete generator, this performance is reduced due to two facts, the above mentioned delay of the initialization logic and the need of storage elements. As can be seen in the table, to the clock cycle improves as more pipeline stages are introduced in the initialization, but only up a certain limit given by the storage elements. For both implementations and four stages the slowest path is determined by the storage elements. This circumstance especially affects to the circular buffer implementation as the complex routing of the shift register limits the frequency at which the registers are shifted in the buffer. Meanwhile for the table implementation, the frequency limit is given by the update of the addresses of the table. Tausworthe 88 presents a really high clock frequency for an FPGA, close to 1 GHz, with an almost negligible use of resources. 38 Random Number Generation

2.5.2. N(0,1) Gaussian Random Number Generator

Following the discussion in Section 2.3 the Inversion-based N(0,1) GRNG with quintic Hermite in- terpolation for the approximation of the ICDF has been considered as the most suited GRNG for a Monte Carlo simulation. Clearly, two different tasks are required for any inversion-based RNG relying on this technique, a first setup stage, where the calculation of the function approximation is carried out, and a second task where the random variables are generated. Thereby, in the first task, the segmentation of the function range and the calculation of the coefficients for each segment are carried out. In the second task, the random variables are generated from uniform random variables: 1. Generation of a uniform random variable u. 2. Search for the segment corresponding to u. 3. Extraction of the coefficients for that segment. −1 5 4. Calculation of GCDF (u) applying Hi (¯u). This methodology fits very well with a hardware architecture. While no changes are done in the segmentation policy or in the type of the interpolation used, the setup stage produces always the same results (the polynomial coefficients and the starting points, ui, of each segment). These are the data required by the generator and as they remain unchanged they can be stored in tables as ROM data. Hence, the setup stage only needs to be computed once. Furthermore, there is no need to do it in the hardware RNG. To simplify the complexity of the RNG, the interpolation can be realized in a software platform which is also in charge of generating the necessary text files in some hardware description language containing the ROM table with the polynomial coefficients and the segment’s starting points. A software implementation of a GRNG with Hermite interpolation has been previously studied by Hörmann et al in [HL03] and can be found in the software tool UNU.RAN [LH02]. We have used some of the framework provided by this tool for developing our hardware GRNG.

2.5.2.1. Base Software Tool: UNU.RAN

The UNU.RAN tool was developed by Hörmann and Leydold to provide a software tool able of au- tomatically developing software random number generators with several distribution functions, using the inverse method with Hermite interpolation as generation method. Additionally, in [HL03] the theoretical framework of the tool is explained and the results obtained for several distributions such as gaussian, exponential, cauchy, etc with different interpolation degrees (linear, cubic and quintic) and approximation error bounds are provided. These results demonstrate that quintic interpolation provides the best results in both the error bounds that can be achieved and the number of segments that are required for obtaining that bounds. 2.5 Developed Gaussian Random Number Generator 39

Figure 2.6: UNU.RAN segmentation.

2.5.2.1. Segmentation

In the UNU.RAN implementation, the segmentation algorithm depends on the accuracy. The desired accuracy is a configurable parameter that determines when a segment is considered as a definitive segment. In UNU.RAN the accuracy is related to the axis x of the inverse function (axis y of the CDF) and is determined as a maximum error allowed in that axis.

| n − | ϵˆu = CDF (Hi (u)) u were u is the uniform value and n the degree of accuracy. The input uniform value is compared with the CDF value of inverting u. As CDF is an exact function this accuracy policy also ensures the accuracy of the inversion. Segmentation is done taking into account the whole function range of the inverse function [0,1] and of the CDF (−∞, ∞). In addition to the accuracy, two more parameters are required for the segmentation: the maximum size of each interval measured as ui − ui−1 and the probability to chop in the tails of the CDF function. Several functions as the gaussian, N(1,0) spread to ±∞. However, no interpolation can be done when reaching ±∞ while the probability of these tails is insignificant. Therefore, interpolation can be restricted to a much smaller range (to the range that concentrates almost all the probability). In this way, the starting and ending points are selected by the configuration of the probability we want to chop. Once the CDF range is delimited, the segmentation is done according to the selected accuracy and the maximum segment size by an iterative method, see Figure 2.6. In each pass, the segmentation algorithm considers the segment corresponding to all the CDF range that has not been segmented yet (initially all the CDF minus the chopped tails). If the segment is bigger than the maximum segment size set, the segment is halved as many times as necessary until it is smaller than the maximum size. Then, the obtained segment is interpolated and the coefficients of the interpolation are calculated. To figure out if the segment fulfills the selected accuracy, the function is calculated at the middle point of the segment (in the uniform axis) using the coefficients calculated 40 Random Number Generation for the segment. The middle point is selected because interpolation is exact at the endpoints, so the bigger error is expected in the middle of the segment. While the error measured in the middle point is bigger than the maximum error allowed, the right endpoint of the calculation segment is modified in order to divide the calculation segment by two. In this way the calculation segment gets smaller and smaller as its interpolation becomes more accurate, until the error at its middle point is below the maximum error allowed. At this point the segment is considered a valid segment, and the segmentation continues with the rest of the range.

2.5.2.1. Search algorithm and generation of gaussian variables

The first task to transform a uniform variable, u, into a gaussian one is finding the segment to which u belongs. The gaussian inverse CDF is monotonically increasing, so it will be the resulting segmen- tation so ordered search methods can be used. In this case, an indexed search is used with an index table which contains pointers to the table containing the starting points of the segments, the search table. With the u value an address to the index table is obtained, that returns a pointer to the search table. This pointer points to the correct segment or some close segment below it with ui < u. In the case the pointed segment is not the correct one, the pointer is increased by one as many times as needed until the correct segment is reached. Finally, once the corresponding segment is found, the coefficients and the starting point of the segment are extracted and the gaussian variable is obtained following equation 2.18.

2.5.2.2. Hardware GRNG

The software UNU.RAN implementation presents a challenge to handle in the case of a hardware implementation and the features we have defined for a Monte Carlo GRNG. The UNU.RAN search algorithm is multicycle, being variable the number of cycles required for each search. This is in clear conflict with the feature of one random sample per cycle with a high clock frequency as the variable number of cycles implies that the throughput will no longer be of one sample per cycle. Therefore, developing a hardware oriented search algorithm that can provide one search per cycle is a must for the hardware implementation, see section 2.5.2.2.3. Additionally, multiple modifications can be introduced to the UNU.RAN method to adapt it to hardware:

1. Segmentation and architecture oriented to hardware uniform for URNGs which typically pro- vide uniform samples as 32 bit integers.

2. Accuracy and architecture oriented to the desired output arithmetic.

3. Simplification of equations to reduce the number of operators and the resources used.

4. Tailored internal arithmetic. 2.5 Developed Gaussian Random Number Generator 41

In the following sections our floating-point GRNG will be shown and these modifications to- gether with other improvements and changes will be explained. To develop the hardware GRNG the following steps have been carried out:

1. Adaptation of the segmentation algorithm to hardware GRNG.

2. Analysis of the segmentation results and analysis of the impact in the architecture.

3. Development of a hardware oriented search algorithm.

4. Architecture design and general improvement of the setup stage in order to simplify the archi- tecture.

2.5.2.2. Accuracy-Adaptive Segmentation Policy

Given a desired accuracy, the range of a function can be segmented in two different ways according to two objectives, to obtain the smallest number of segments or to achieve an efficient segmentation in terms of the segment search. As explained before, segment search can compromise the performance of the GRNG taken that search algorithms are usually multicycle. To avoid this problem, segmentation points can be chosen in such a way that segment search of the uniform variable is restricted to analyze the value of some of its bits. One example of this segmentation is the Hierarchical segmentation used in [LCVL06]. However, selecting the segmentation points based on their search easiness has the negative effect of increasing the number of segments needed to achieve the desired accuracy, and for very high accuracies the number of segments is too large. To obtain the lowest possible number of segments (and therefore requiring the smallest tables), an accuracy-adaptive segmentation method, based on the iterative one employed in [LH02], has been used. Our method introduce several modifications to the original one. First, we have adapted the method to the uniform values that we use in the hardware architecture, 32 bit integer values (from "00000001" to "FFFFFFFF", the zero value is not considered as F −1(0) = −∞), instead of dou- ble floating-point values. Hence, the segment’s initial points will always match a uniform number of the hardware range and will allow a search algorithm based on integer search. Second, we have changed how the accuracy related to the interpolation error is measured, from an absolute error mea- sured in the middle point on the uniform axis, to a relative error in the gaussian axis using direct inversion [Mor83]. In this way, the true error is measured and not approximated as before. By using a relative error we can ensure that, for all segments, all generated floating-point gaussian variables generated have the same number of mantissa bits of accuracy. With this method we obtain non-uniform, non hierarchical segments adapted to the desired accu- racy (number of accurate bits of mantissa). 42 Random Number Generation

2.5.2.2. Segmentation and Coefficient Analysis

The analysis of the results obtained for both the original segmentation method and ours has been realized using a double floating-point arithmetic and different searched error bounds. This analysis has focused on three aspects: accuracy, number of segments and value of the coefficients. With respect to the first two aspects, the analysis proved that the final accuracy obtained (for a reasonable number of segments) is limited by the interpolation in the boundaries of the GCDF−1, where the function tends to infinity and its value can be represented using single floating-point arithmetic. Hence, this precision can be adopted without significant accuracy loss. At the boundaries, GCDF−1 is hardly interpolated and the last segments are very small, only containing one point of the possible uniform values. Regarding the coefficient values, it has to be considered that GCDF−1 is an odd function around 0.5, so the generated Gaussian random variables obtained from u and 1 − u will only differ in the sign. In this way, it is only needed to segment half of the GCDF−1 to implement the inversion. The coefficient values obtained can be clearly differentiated between the segments before and after 0.5. All segments before 0.5 have negative coefficients, while for the segments after 0.5, only the segment starting at 0.5 (only one coefficient and very close to zero) and the segments at the GCDF−1 boundary (those containing only one possible uniform value) have negative coefficients. Additionally, considering single precision, all of the coefficients are normalized numbers and their multiplication by the minimum u¯ value 2−32 (zero is not considering) are also normalized numbers. Negative coefficients imply subtractions in the polynomial requiring of adder/subtracters which are more expensive than adders. Thus the upper half of GCDF−1 has been selected to implement the inversion and its negative coefficients have been eliminated. Replacing the negative coefficient of the segment starting at 0.5 by a zero has no consequences on the global accuracy (its value is very close to zero and much smaller than the other segment coefficients), while segmentation in the boundary has been recalculated with linear interpolations including two uniform values. In this way, since GCDF−1 is monotonically increasing, the interpolations of these segments have only positive values (ai0 and ai1 and the rest are zeros) and the accuracy in the boundary is improved while the number of segments is reduced.

2.5.2.2. Search Unit. Search Algorithm

The search algorithm we have developed is the key factor for obtaining one sample per cycle and high throughput. Non-uniform, non-hierarchical segmentation makes necessary some kind of hardware adapted search algorithm to overcome the multicycle search of software algorithms. Direct adaptation of a software search algorithm will have a negative impact on the FPGA im- plementation because it requires multiple accesses over a search table. Handling multiple accesses requires either stalls, compromising the pipeline throughput of one sample per cycle, or the duplica- 2.5 Developed Gaussian Random Number Generator 43 tion of search tables (as many as the maximum number of searches needed) through different pipeline stages increasing the resources used. Taking as basis the search algorithm of UNU.RAN and taking advantage of the characteristics of nowadays FPGAs, with dual port RAM memory blocks, which allow the simultaneous reads of two different positions, a pipelined hardware oriented search method relying on an index table and a search table has been developed. The idea is simple, we construct the index table in such a way that for any u the obtained pointer to the search table always points to the correct segment or just the previous one. In this way, and due to the availability of dual port RAMs in FPGAs, each search can be finished with just one access to the indexed table to obtain the pointer and another one to the search table reading simultaneously the segment’s starting points at pointer and pointer+1. A subsequent comparison of the searched value with the segment starting point at pointer+1 will determine to which of the two segments belongs the searched value. To obtain the desired index table (in the setup stage), a local search scheme based on fixed-point arithmetic and multiple local index tables was developed (the searched variables are 32 bits fixed- point variables). The value range of searched variables is divided into several parts each one with its own local index table that ensures a single access search. The range division into local tables is done in an iterative way:

1. Selecting a subset of bits from u that can be used as an address. 2. With that subset, form a local table until there is a value that will point to a segment that is not the correct one or the one below it. 3. Set the undivided range until the uncorrect value as the range for that local table. 4. Update the range to divide by subtracting the set range. 5. Repeat the previous steps until all the range is indexed by local tables.

A global index table comprises all the local index tables. To obtain the correct address to the global table a bit comparison scheme is used. First, the input u is compared against the values that define the boundaries between the local tables. Once determined to which local table u belongs, the address is formed by adding the position at which the local table starts within the global one to the corresponding subset of u bits.

2.5.2.2. Architecture

The architecture needed for an inversion based RNG is basically the same independently of the gener- ated distribution. Only one modification can be introduced. As explained in Section 2.5.2.2.2, when the desired distribution is an odd function with respect to its middle point, only half of the CDF−1 has to be implemented. In this way we can reduce to half the number of segments required and con- 44 Random Number Generation

COEFFICIENT TABLES

U(0,1) U [0.5,1) address i URNG FOLD SEARCH UNIT UNIT

a 5 a 4 Ui

a 3 U a 2

FORMAT U fixed f.p. a 1 UNIT gaussian a 0 sample value

sign

Figure 2.7: Inversion N(0,1) RNG architecture.

INDEX TABLE local table 0

index  u ADDRESS address  p+1 SEARCH   1 UNIT TABLE

local table k p up ui up+1 u address p i p+1

Figure 2.8: Search unit. sequently the resources required for storing tables. In these cases, some logic must be introduced to transform the uniform random samples in the range (0,1) to the range [0.5,1). Figure 2.7 shows the architecture of our N(0,1) RNG. As it can be observed, the 5-degree poly- nomial is calculated with Horner’s rule1 reducing the calculation of the polynomial to five multipli- cations and additions. Apart from the polynomial calculation and the base URNG, the architecture is composed of four main units. The Fold Unit represents the extra logic to transform the uniform samples to the range [0.5,1) (bitwise negation and plus ’1’). The calculation segment for the uni- form variable is obtained in the search unit, Figure 2.8, which implements the algorithm explained in Section 2.5.2.2.3. Finally, a Format Unit transforms the fixed-point representation of u¯ into a tailored floating- point representation. The use of standard floating-point arithmetic instead of fixed-point arithmetic introduces a very heavy computational overhead, see Chapter 3. However, in the implemented N(0,1) RNG Infinities and NaN or denormalized numbers can never be produced due to the coefficients we have computed and the operations we have done. Hence, the floating-point arithmetic needed is greatly simplified as it only needs to handle nor-

1 p(x) = a0 + x(a1 + x(a2 + x(...))) 2.5 Developed Gaussian Random Number Generator 45

Table 2.5: Accuracy and Segmentation. Searched Accuracy 21 31 34 44 47 Obtained Accuracy 20 20 19 24 20 Segments 185 464 779 3939 11519

Table 2.6: Maximum segment size - Number of segments (searched 21 mantissa bits of accuracy) Segment size 28 27 26 25 24 23 Segments 113 112 122 147 202 327 malized numbers and zeros (handling denormalized numbers composes most of the prenormalization and normalization) and adders do not need the logic required for subtraction.

Finally, the floating-point representation is also based on the arithmetic used for u¯ = u − ui. u¯ is one of the operands for the five multipliers and the way it is represented significantly affects the resources used by the pipeline and the logic needed in the multipliers. Thereby, the tailored floating- point format is selected according to the range of values of u¯ whose maximum value is determined by the largest segment (equal or smaller than the maximum segment size selected for the segmentation) while its minimum value is zero.

2.5.2.3. GRNG Implementation Results

The searched accuracy on the setup stage has been of 21 bits of mantissa (22 bits of accuracy counting the floating-point hidden bit). This accuracy has been determined by experimental trials with our modified software but with the original maximum segment size, 0.05 in the u range, see Table 2.5 where the search accuracy and obtained accuracy are set in mantissa bits. In the table some representative accuracy results are depicted. Although the search accuracy is increased, the results detect that although the biggest part of the range fulfills the searched accuracy, there are always several segments close to the tail where the accuracy diminishes around to 20 man- tissa bits. This is due to the fact that the accuracy-segmentation algorithm measures accuracy in the middle point of a segment, but the error approximation can be slightly bigger in points close to the middle. According to the results and as the accuracy obtained is in the order of single precision, we have also decided to use single precision coefficients instead of double precision due to the impact of using double precision in resources and performance in an FPGA, and as the accuracy obtained has been of the same order, 20 mantissa bits. In Table 2.6 the segmentation results for 21 mantissa bits, single precision segmentation and several maximum segment sizes, now measured in bit-widths, are summarized. To get advantage of the maximum capacity of FPGA Block RAMs (256 32-bits word for a Virtex- 46 Random Number Generation

4) while improving the segmentation and the hardware architecture minimizing the number of bits needed to represent u¯, we haver selected 24 bits as maximum segment size. Additionally, the seg- mentation has been expanded to 256 segments. The extra segments come from the use of linear interpolations in the extreme to ensure the accuracy in the tails. With this segmentation the segment search is split in two. The function range is easily approxi- mated in most of the range, starting at 0.5, so all the first segments have the maximum segment size. Meanwhile the last segments, the ones belonging to the tails of the functions, just comprise two points with a linear interpolation. In both cases, a direct search based on some bits of the numbers is possible (forcing the first and the last segment search): Logic Search: • Segment 0: direct search < 0x80000002 → address 0 • Segments 1-126: direct address → ’0’ & bits(30-24) +1 • Segments 178-254: direct address → bits(8-1) • Segment 255: forced address → address 254 Meanwhile, for the remaining range the developed search algorithm is employed with local index search tables of 128 positions each:

• Local table 1: Segments 127-144. Search bits (25-19) • Local table 2: Segments 145-160. Search bits (20-14) • Local table 3: Segments 161-176. Search bits (14-8) • Local table 4: Segments 177-182. Search bits (9-3)

In previous works another inversion-based GRNG can be found [LCVL06]. This GRNG is based on a splines approximation of degree two with Chebyshev coefficients and a hierarchical segmenta- tion of the function range. However, this GRNG is not suited for Monte Carlo simulations: it does not ensure very high accuracy due to the usage of 16-bits fixed-point arithmetic and the selected inter- polation degree and coefficients. Additionally, 16-bits fixed-point representation discard it for a wide range of Monte Carlo applications. In Table 2.7, we reproduce the results of [LCVL06] of our GRNG. [LCVL06] was implemented in a Xilinx Virtex-II XC2V4000-6, so the results of our GRNG are also obtained for that FPGA (VX-II GRNG). The combination of a higher degree interpolation, and the use of Hermite coefficients and floating- point arithmetic allows to approximate the CDF − 1 much more accurately. In particular, seven more bits of absolute accuracy2 with respect to [LCVL06] in our worst case. And in terms of relative accuracy3 the improvement achieved is more than five orders of magnitude.

2Maximum error value between GCDF−1 and the polynomial interpolation. 3Maximum percentage error in a Gaussian variable 2.5 Developed Gaussian Random Number Generator 47

Table 2.7: FPGA N(0,1) RNG results.

Slices Block DSP Clock Throughput Max Absolute Relative Sample RAM Mult [MHz] [MSample/s] σ Accuracy Accuracy Format Lee [LCVL06] 585 1 4 231.0 231.0 8.2 0.3 × 2−11 30% fixed 16b VX-II GRNG 1954 5 20 179.4 179.4 6.23 0.5 × 2−18 0.000047% 32b single f.p. VX-4 CDF −1 1684 5 20 236.1 236.1 6.23 0.5 × 2−18 0.000047% 32b single f.p. VX-4 GRNG 1757 5 20 220.8 220.8 6.23 0.5 × 2−18 0.000047% 32b single f.p.

Control U(0,n−1) Unit zero detector URNG ui (0,1) s ui Replace Unit

Figure 2.9: Variance reduction general hardware architecture. However, the improved accuracy comes at the cost of increasing the number of resources: the tailored single precision floating-point arithmetic used instead of the fixed-point one, the higher in- terpolation degree and the sample generation of 32 bits (24 of mantissa) instead of 16 are the reasons for the increase of resources with respect to [LCVL06]. In particular, the difference between inter- polation degrees seriously impacts the resource usage as each degree implies one multiplier (4 DSP), half Block RAM and one adder. The use of samples with more bits and the complexity of floating-point arithmetic also explains the lose of speed and throughput (around a 20%). However, this is a very small performance penalty taking into account the accuracy improvement as the floating-point complexity is handled efficiently with a very deep pipelined architecture of 48 stages.

2.5.3. Stratified Sampling and Latin Hypercube

The devised hardware architecture for both Stratified Sampling and Latin Hypercube is composed of two main parts. The first part is the control unit, which controls the generation of the stratas in the form of integers which correspond to the strata, while the second part involves the arithmetic operations needed to transform uniform random variables into stratified ones. A simplified scheme of the architecture is depicted in Figure 2.9. Besides the control unit, two fixed-point operators, an adder and a divisor are needed to accomplished the stratification of the random variable: u i us = i + ; where i = 0, ..., n − 1 (2.40) i n n 48 Random Number Generation

Figure 2.10: Stratified Sampling Control Unit.

Additionally logic is needed for the case of a zero generation. To preserve the symmetry of some distributions (such as Gaussian) and also because a zero can led to −∞ when using the inverse generation method with not enclosed distributions (as Gaussian), the value zero has to be removed to obtain uniform random variables in (0,1). In this case, it is needed to generate an alternative uniform variable belonging to the same strata as the zero. One important feature for any architecture implementing a variance reduction technique is its configurability to allow simulations with different requirements in the number of stratas, n, or the number of dimensions, d. To achieve this feature, the whole architecture is parameterizable selecting a maximum value for both n and d.

2.5.3.1. Stratified Sampling. Control

The control unit deals with the generation of the starting points of the stratas. One key factor has to be considered for this generation, as usually each simulation does not only use one random variate, but rather a group of them. In this case, stratified sampling has to be applied to each of the random variates used, among a group of simulations. This fact introduces a new requirement, the generation s of ui will require of randomness in the selection of the strata starting points, to obtain non-ordered stratified samples. If ordered generation is used, all the random variates for the same simulation will belong to the same strata and the results will be distorted. A combination of a URNG with a shuffle technique over a memory table can effectively handle that requirement. While stratas starting points can be managed as a uniform distribution between 0 and n − 1 because they are scaled later dividing by n, the shuffle technique will introduce the randomness in the stratas starting points. The shuffle technique is applied to the integer values stored in the memory table by swapping pairs of values. The randomness is introduced in the selection of one of the memory positions to be swapped while the other one is determined by a counter that ensures that all memory positions are swapped at least once in the next algorithm:

for(i=n-1,i>=0,i–){ 2.5 Developed Gaussian Random Number Generator 49

Global Shuffle Shuffle Read Table Table Table

addr_random 0 0

counter addr_fixed           counter active table dimensions Shuffle Read Table Table dim−1 dim−1

   table read addr read counter  counter dimensions

Figure 2.11: Latin Hypercube Control Unit.

addr_fixed=i; addr_random=U(0,n-1)*(i+1); swap(table[addr_fixed],table[addr_random]); }

The hardware implementation of this algorithm is shown in Figure 2.10. The shuffle algorithm is implemented in a table which starts with shuffle_table[i]=i. When the strata counter reaches zero, the algorithm is completed and the random order generated is copied to another table. The use of this second table allows that the generated order can be read from this table while the next random order is being generated in the shuffle table. With this procedure, one strata starting point can be read per cycle after an initial latency of n cycles on a pipelined implementation.

2.5.3.2. Latin Hypercube Extension. Control

The Latin Hypercube control works in a similar way but with one pair of tables per dimension, as can be seen in Figure 2.11. The shuffle technique is applied sequentially among the shuffle tables generating the random order for each dimension after the random order of the previous dimension is finished. When both counters, for strata and dimension, reach zero, the generated orders are copied to their corresponding reading table. Reading from this table is also sequential, but now reading is carried out among all the tables, since we are generating a multidimensional variable. The same memory position is read sequentially for all the dimensions from their corresponding reading table. In this way, after 50 Random Number Generation

Table 2.8: Variance Reduction implementation Results

Sb 1 2 3 4 5 6 Slc. MHz Slc. MHz Slc. MHz Slc. MHz Slc. MHz Slc. MHz SS 209 321.3 300 321.3 440 321.3 385 318.6 631 277.8 1268 244.6 LH-1 219 321.3 326 321.3 470 321.3 535 274.2 1125 248.0 1851 200.1 LH-2 232 321.3 368 321.3 553 274.2 929 245.1 1624 203.2 3390 174.0 LH-3 255 321.3 431 274.2 785 244.6 1285 201.0 2938 175.7 6446 149.7 d cycles, the d strata starting points needed for the uniform variables of a multidimensional variable will be generated. With this type of control, as in stratified sampling, one uniform variable is generated per cycle in a pipelined architecture. Now the initial latency needed to generate the first random order of each dimension is n ∗ d cycles.

2.5.3.3. Implementation Results

The results obtained for both techniques are summarized in Table 2.8 and are graphically depicted in Figure 2.12 with respect to the strata bit-width, Sb, for both techniques, and with respect the dimension bit-width for the Latin Hypercube (denoted as LH-Db, dimension bit-width). For Sb > 3 a DSP is required for the multiplier that previously was implemented in logic. These results are determined by the slices required for the tables. These tables must be imple- mented in logic as the use of Block RAMs is not possible because the shuffle table must be completely copied to the read table in just one cycle to allow a throughput of one sample per clock cycle. The number of words in the table increases exponentially with the strata and dimension bit-widths, while the number of bits per word increases linearly with the strata bit-width. Initially, the tables are very small and almost half of the logic is due to the divisor (101 slices, 48.5% in the worst case, stratified sampling with stratas of only one bit). However, as the bit-widths increase, the use of resources is mainly due to the tables and the divisor resources become negligible when compared. For LH-3 and

Sb = 6, the divider just represents a 3.4% of the LH, 221 slices, while the whole LH unit requires almost 10% of the FPGA. With respect to the working frequency, the same circumstance happens, as can be seen graphically in Figure 2.12. Initially the slowest path is due to the divisor but for tables of more than eight positions the access to read and write the shuffle table becomes the slowest stage, decreasing the speed of this access as the table length grows. With respect to the latency, before a throughput of one sample per cycle is achieved we will have an initial latency of n cycles (stratified sampling) or n ∗ d cycles (Latin Hypercube). They are needed to generate the first random order plus 32 cycles introduced by the pipelined divisor and one additional cycle for the rest of the logic. 2.5 Developed Gaussian Random Number Generator 51

Figure 2.12: Stratified Sampling and Latin Hypercube results.

Figure 2.13: Inversion based GRNG with Variance Reduction technique.

2.5.4. Complete GRNG and SW-HW comparison

The architecture of a the GRNG with the variance reduction technique is shown in Figure 2.13. As can be seen, employing variance reduction techniques with an inversion based generator can be done directly just by introducing the variance reduction technique in between the URNG and the inverse function. Only some control issues must be handled between the URNG and the variance reduction control, see Section 5.4.5.2. Hence, it is just like changing one URNG by another, only that now the URNG has more logic and complexity as the inverse function just transforms the uniform random samples into samples of the target distribution preserving its statistical features. How these uniform samples are generated is absolutely unimportant for the inverse function module.

In Table 2.9 (Sb for the strata bit-width and Db for the dimension bit-width) the results of the complete generator are shown for several variance reduction implementations while using de the MT

Table 2.9: Complete N(0,1) RNG results. Table Slices Block DSP Clock % FPGA Sb Db size RAMs Mult [MHz] 1 0 2 2107 9 23 220.8 11% (DSP) 4 2 64 2712 9 24 220.8 12% (DSP) 6 2 256 5249 9 24 174.0 12% (DSP) 6 3 512 8304 9 24 149.7 13% (Slices) 52 Random Number Generation

Table 2.10: Hardware-Software Comparison

URNG N(0, 1)CDF −1 VR GRNG 5 Rand2 Taus88 MT Direct [Mor83] Hi (4,2) (6,3) MT MT-(4,2) MT-(6,3) SW MSS 38.3 77.9 65.4 3.8 15.9 33.9 35.5 13.0 9.1 9.0 HW MSS - 943.4 345.9 - 236.1 321.3 149.7 220.8 220.8 149.7 x Speed Up - 12.1 5.3 - 14.8 9.5 4.3 17.0 24.2 16.7 generator as the base URNG and a basic Tausworthe URNG for the strata shuffling technique. As expected for a high number of table positions, the clock frequency of the generator decreases while the use of slices grows significantly, even up to a 13% of the FPGA.

2.5.4.1. Summary of Throughput Results. Hardware and Software Comparison

A hardware-software comparison for all the previous modules is provided in Table 2.10 using an Intel I7 920 at 2.67 GHz as counterpart. 109 samples are generated for the URNG and 108 samples for the other modules. Hardware and Software throughput is measured in number of samples generated per second (Mega Samples Second, MSS). With respect to the results obtained for the URNG, a first difference should be emphasized, the different cost of the operations in hardware and software and the control cost in software. As previ- ously mentioned, Rand2 URNG was not suitable for FPGA. It requires arithmetic operations which imply a slow datapath since using a pipeline in the Rand2 is in opposition to achieving one sample per clock cycle. However, in software Rand2 clock rate (measured as mega samples per second (MSS)) is only half than Tausworthe 88 clock rate. The first reason to such small difference is that the mi- croprocessor ALUs work at a fixed clock rate although bitwise operations are much faster than that clock rate, and therefore harming their performance. A second reason is that the control cost (a for iteration for executing the URNG as many times as samples required) is a common overhead for the three URNGs. On the other hand, in the FPGA, as less logic is in the combinational path between two registers lower is the delay for that path. Hence, Tausworthe 88 presents the highest clock rate as only implies two XOR operations in the FPGA path, achieving almost one gigasamples per second, with a speedup factor of 12.1 with respect to software. Meanwhile, MT (4-cycle initialization and memory table implementation), as its performance is limited by the addresses update, the speedup factor decreases to 5.3 although the software implementation is still slower. Following with how the N(0,1) CDF−1 is implemented, in software, the developed interpolation- based inversion is only four times faster than direct inversion, while the hardware interpolation presents a speedup factor of almost 15 due to the hardware deep pipeline. Finally, we can focus on variance reduction and the GRNG results with MT as base URNG. 2.6 Extending N(0,1) RNG 53

Studied alone, the speedup factors of Latin Hypercube, identified in the table as (Sb,Db), is not very high. Software can accomplish them quickly as the generation of stratas requires few instructions per permutation, while the afterwards logic means no problem for a microprocessor. Meanwhile, the FPGA implementation clock rate suffers from the size of the table. However, if the GRNG is studied as a whole, we can observe that HW achieves a speed factor of 17 when no variance reduction technique is employed. If they are used, the speedup factor reaches up to 24.2, but as the tables’ speed slows down the processing, the speedup factor decreases to 16.7. All these results stand out the capabilities of FPGAs for hardware acceleration of high quality random number generation. These speedup factors even make that hardware RNGs are also suited to be used as accelerators themselves and not only integrated in a hardware accelerator with a Monte Carlo simulation.

2.6. Extending N(0,1) RNG

N(0,1) RNG is not the only useful generator for Monte Carlo simulation. Using the inversion method and quintic Hermite interpolation following the same methodology and architecture as for N(0,1), other RNGs for a wide variety of distributions can be implemented. However, for some related distributions, other alternatives can be analyzed. For example, we have considered the general Gaussian distribution N(µ,σ) and the Log-normal LogN(µ,σ) in addition to N(0,1). With respect to N(µ,σ) and for each pair of (µ,σ) it is not feasible to have a specific hardware generator with its own segmentation, search algorithm and coefficients. However, due to the mathe- matical properties of Gaussian distributions, variables from one distribution N1(µ1, σ1) (v1) can be transformed to variables from N2(µ2, σ2) (v2) directly:

σ2 v2 = (v1 − µ1) × + µ2 (2.41) σ1 allowing to design a parameterisable N(µ,σ) just by having one base N1(µ1, σ1) and allowing the parameterisation of µ and σ. Furthermore, from formula 2.41, it is clear that the simplest way to generate the parameterisable N(µ,σ) is to use as base Gaussian RNG one with normal distribution, N(0,1), so the previous transformation is reduced to:

v2 = vn × σ2 + µ2 (2.42)

On the other hand, we have the Log-normal distribution that is a key feature for some Monte Carlo simulations as it captures random changes in relative magnitude, rather than the absolute magnitude changes provided by the Gaussian distribution. For example, some financial models like the Black- 54 Random Number Generation

Figure 2.14: Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG

Slices Block DSP Speed Troughtput Stages Speed Up RAMs Mult [MHz] [million sample/s] x times Complete RNG 2736 10 31 220.8 220.8 78 LogN(µ, σ) RNG 2732 10 31 220.8 220.8 78 29.0 N(µ, σ) RNG 2313 9 27 220.8 220.8 62 18.1 N(0,1) RNG 1909 9 23 220.8 220.8 48 17.0

Table 2.11: Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG results.

Scholes-Merton model are based on Log-normal distributions [BS73, Mer73]. As its name implies, a Log-normal distribution is a probability distribution function whose loga- rithm is a normal distribution. Hence, it can be obtained by combining an exponential function unit with a Gaussian RNG. If X is a Log-normal distribution, then Y = ln X is a Gaussian distribution, and inversely X can be obtained from the Gaussian distribution: X = eY .

2.6.1. Parameterisable RNG based on N(0,1) RNG

Hence, the inversion-based N(0,1) can be extended with a multiplier, an adder, and an exponen- tial function to obtain a parameterisable random generator capable of generating any N(µ,σ) or LogN(µ,σ), while the high statistical properties and compatibility with variance reduction techniques remains unchanged as they are determined exclusively by the base uniform RNG. The required archi- tecture is shown in Figure 2.14. As the base N(0,1) RNG generates single floating-point arithmetic samples, the operators re- quired must comply with this arithmetic. In particular, the three new operators have been the ones corresponding to the exponential, multiplier and adder-subtracter of the FPGA oriented library, HW, of Section 3.6.3. In Table 2.11 the implementation results of the parameterisable RNG can be found for the reference FPGA taking the Mersenne Twister as base URNG. As can be seen the addition of arithmetic operators does not affect the clock frequency as the complete generator works with the same clock rate than the base N(0,1) RNG. However, from the point of view of the resources and the depth of the pipeline, the operators have a strong impact. Focusing on the speedup factor, the introduction of new operators has a performance cost for the software as the computational load per random variable increases. The addition and the multiplier, 2.7 Conclusions 55 imply only a little computational overhead (speed factor increased to 18.1 and hardware performance remaining invariable) as modern processors have dedicated hardware for these floating-point opera- tions. However, the exponential function has to be emulated with a mathematical library, implying a big computational overhead. In this way the speedup factor increases considerably.

2.7. Conclusions

In this chapter, RNGs have been studied from the perspective of their FPGA implementation and of hardware acceleration. We have focused on four topics, uniform and gaussian random number generation, the inversion method with quintic Hermite interpolation for transforming uniform samples into samples of the target distribution, and variance reduction techniques. A parameterizable Gaussian RNG has been developed which is the base for our target application and that have been developed to be reused for many other applications. Our RNG is based in three main components, that are related to the main contributions of our research in this field:

• The first contribution is a high performance Mersenne Twister uniform RNG. The developed MT uniform RNG, outperforms any previous implementation in the literature while implement- ing the whole tasks related to this uniform RNG. Furthermore, we have proposed a new archi- tecture, specifically designed for FPGAs which take advantage of the characteristic resources of some FPGAs as are the shift-registers. • Our second contribution is the developed methods and techniques to implement the splines ap- proximation of an inverse cumulative distribution function for FPGAs and with acceleration features. Hardware related segmentation and a specific search algorithms have been studied and designed with the goal of avoiding the intrinsic multicycle search related to classic interpo- lations with non-uniform non-hierarchical segmentation. In our case, we have focused on the Gaussian ICDF to use it as key component of an inversion-based Gausssian RNG. • Finally, the last contribution, has been the research related to variance reduction techniques for FPGA and hardware acceleration. A hardware core for the Latin hypercube technique has been studied and implemented.

With these three contributions, we have developed a high-quality high-performance Gaussian RNG which can be extended with a Latin Hypercube core. The developed generator is two-ways parameterizable: in the distribution obtained, which can be any gaussian distribution or even any log-normal distribution with the addition of three floating-point operators, and parameterizable in the number of stratas and dimensions for the Latin Hypercube. The obtained results show the important speedup factors that can be achieved with hardware RGNs even for just uniform RNG, making RNGs ideal candidates for hardware acceleration inte- grated in more complex accelerators or just by themselves. Meanwhile, our complete generator can 56 Random Number Generation be used in our target application or in other Monte Carlo accelerators. The work done in this chapter has been published in several conferences and workshops [ELV07, ELVLB08, ELVP08, ETLVL08]. In particular:

• Pedro Echeverría and Marisa López-Vallejo. FPGA gaussian random number generator based on quintic hermite interpolation inversion. 2007. • Pedro Echeverría, Marisa López-Vallejo, and Carlos López-Barrio. Inversion-based FPGA random number generation using quintic hermite interpolation. 2008. • Pedro Echeverría, Marisa López-Vallejo, and José María Pesquero. Variance reduction tech- niques for Monte Carlo simulations. a parameterizable FPGA approach. 2008. • Pedro Echeverría, D. B. Thomas, Marisa López-Vallejo, and Wayne Luk. An FPGA run-time parameterisable log-normal random number generator. 2008. while a short paper is under review up to date:

• Pedro Echeverría and Marisa López-Vallejo. High performance FPGA-oriented Mersenne Twister uniform random number generator. 2011. and a long journal paper is under preparation:

• Pedro Echeverría and Marisa López-Vallejo. FPGA paremeterizable Gausssian random number generator for hardware acceleration. 3

Implementing Floating-Point Arithmetic in Configurable Logic

Most scientific and financial simulations rely on floating-point arithmetic computations due to the huge range and accuracy that can be obtained when using this arithmetic ant its most used precisions. However, floating-point arithmetic is characterized by a high complexity (see Section 3.2.1) which makes the design and implementation of operators expensive in terms of time and, moreover, in terms of silicon. Due to this, current general purpose microprocessor floating-point units are expensive and therefore, only units for the most common arithmetic operations, multiplication and addition, can be found (Intel Core i7). For all the other mathematical operations and functions there is no dedicated silicon. They are emulated by software using the general operations provided by the floating-point ALU, and involving a great number of instructions and consequently large execution times. Thus, the implementation of dedicated hardware units, as in FPGAs, provides a clear advantage in terms of performance [dDDCT08], and therefore making that FPGAs become an ideal hardware to accelerate mathematical algorithms. In this context, the availability of complete and fully charac- terized floating-point operators is essential to implement those applications targeting FPGAs. This is 58 Implementing Floating-Point Arithmetic in Configurable Logic a real challenge due to the complexity of floating-point arithmetic and the associated cost that is has: performance, silicon, deep pipelines, etc. In this chapter, we study in depth how floating-point operators can be adapted and implemented in FPGAs when the design focus is hardware acceleration. Taking advantage of the flexibility and the resources provided by FPGAs, the floating-point format is adapted. We have chosen a general approach for all operators: a well known algorithm suited for FPGAs is selected for each operators, and, relying on those algorithms, a standard compliance library is developed. Then, we analyze slight deviations to reduce the complexity of the standard with the goal of increasing the performance of the operators while reducing their resources. These deviations are gradually introduced in the previously developed library obtaining a set of libraries. Through the comparison of the results obtained for these libraries, we can focus on one of the main objective of this chapter: to study the impact of these deviations, both in the improvement of the operators and in the impact on standard compliance and in the accuracy of the operators. In this way, the impact of these deviations is studied in-depth to analyze the performance-accuracy- standard compliance tradeoffs. Afterwards, a second analysis is carried out: how to solve the lose of accuracy and precision related to some of the deviations. The objective of this study is to provide a set of recommendations/changes for FPGA implementation of floating-point operators and, finally, to develop a library of operators following the changes providing the best results. One last objective of these chapter is the study of FPGA capabilities for implementing datachains with huge number of operators and how the performance of the operators is affected in these cases.

3.1. Related Works

Being key elements for many applications, the implementation of floating-point FPGA operators is a field of increasing research activity, especially if we consider that peak FPGA floating-point perfor- mance is growing significantly faster than their CPU counterpart [Und04] and the very reduced set of floating-point operations directly implemented in microprocessors. Furthermore, the new FPGA architectures have embedded resources which can simplify the implementation of floating-point op- erators, as it is the case of multipliers. The research in this field comprises two different perspectives. The first one is focused on the internal architecture of these operators, searching for highly optimized operators that take advantage of FPGAs architecture. Therefore, this perspective comprises the research focused on exploring the algorithms and architectures that are more suitable for FPGAs in general or for specific applications on FPGAs. In Section 3.4 some of these works will be introduced. The second perspective is focused on how the floating-point standard and format can be adapted to FPGAs, with slight simplifications. These simplifications can affect some feature of the standard, 3.1 Related Works 59

Table 3.1: Floating-Point Operators Libraries. Four Basic Operators

[LB02] [WBL06] [GSP05] [DdD] [MFHAR10] Precision 32, 64 Parameterizable Parameterizable Parameterizable 24, 32, 43, 64 Divisor SRT4 Taylor SRT4 and NR SRT4 Several Square Root NR Taylor NR NR Several Denormalized as zero Yes Yes Yes Yes - Standard Compliant No No Yes No - Improved features No No No Yes No however, implementations with slight deviations from the standard can be of great interest, since many applications can afore some accuracy reduction [CHL08, ZLHea05] given the important advantages that can be achieved: reduced hardware resources and increased performance. In both fields, most of the research is focused on the four more common operators: adder/subtracter, multiplier, divisor and square root. Adder/subtracter and multiplier operators algorithms are basically the same for all units in the literature, being the research focused on the other aspects of the units like how to handle some features of the standard. Meanwhile, for the divisor and the square root op- erators the calculation unit can be implemented with several algorithms, see Section 3.4. Regarding advanced operators (exponential, logarithm and exponentiation) few works can be found, standing out [DR04, DdD05a, DdD05b, DdDP07a], with implementations of the exponential and of the loga- rithm. Several works have focused on providing libraries for the four basic operators. In Table 3.1, the most outstanding libraries are summarized. As can be seen, all of these libraries present at least one modification to the standard, the replacement of denormalized numbers with zeros, see Section 3.2. The approach we follow in this chapter was also discussed in [GSP05] but we are extending it here in several ways:

• We focus on the study overhead that the floating-point standard and format implies for FPGAs operators, not just in providing a library of operators.

• We have included advanced operators.

• A more complete set of deviations is studied.

• We perform an in-depth analysis of the implications of the deviations.

• We study the replicability of the operators.

• We provide a set of recommendations to achieve the resolution and accuracy of the standard with high performance. 60 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.1: Floating-Point word.

3.2. Floating Point Format IEEE 754

The IEEE Standard [IEE85] is mainly designed for software architectures, usually employing 32 bit words (single precision) or 64 bit words (double precision). Each word (Figure 3.1) is composed of a sign (s), a mantissa (mnt) and an exponent (exp), being the value of a number:

′ s × mnt′ × 2exp = s × h.mnt × 2exp−bias (3.1) where h is an implicit bit known as the hidden bit, the bias is a constant that depends on the bit width w −1 of the exponent (wE) as its value is 2 E − 1, and the mantissa bit width is wF . Typical values for wE and wF in software implementations are 8 and 23 respectively for single precision, and 11 and 52 respectively for double precision. The combination of the use of an exponent (the reference value, where a number is in the range) with a mantissa (the significative part of the number determining the accuracy and the exact value of the number), produces an arithmetic that covers a very wide range of numbers in an adaptive way. The value of the difference between two consecutive representable floating-point numbers depends of the exponent value and, therefore, the range is not covered uniformly, having more numbers around zero and diminishing when the range tends to ±∞. Therefore, the obtention of accurate numbers in any part of the range is ensured with this scheme. The word format used in floating-point arithmetic allows the representation of four different types of numbers plus exceptions: • Normalized numbers: the most common numbers. • Denormalized numbers: numbers very close to zero. • Zeros: signed zeros → ±0. • Infinities: with sign → ±∞. • Not a Number (NaN): to indicate exceptions. The differentiation among these five types is based on the exponent and mantissa values, while the value of the hidden bit is 1 for normalized numbers and 0 for denormalized ones. Table 3.2 depicts all possible combinations of exponent and mantissa values and their resulting type of number. Finally, although denormalized numbers present a 0 as their exponent value, their values corre- spond to an exponent value of 1 instead of 0. 3.2 Floating Point Format IEEE 754 61

Table 3.2: Types of floating-point numbers. Type Exponent Mantissa h value Zero 0 0 - ±0 Denormalized 0 ≠ 0 0 Eq. 3.1 Normalized 1 to 2eb − 2 - 1 Eq. 3.1 − Infinities 2eb 1 0 - ±∞ − NaN 2eb 1 ≠ 0 - -

Figure 3.2: Floating-Point Operator.

3.2.1. Format Complexity

The standard is specifically designed to handle the five types of numbers sharing a common format while maximizing the total amount of numbers that can be represented. The combination of these two facts has a high cost in terms of the complexity of the format and therefore in the complexity of the floating-point operators. In addition to the calculation unit itself, it is needed a preprocessing stage (also known as prenormalization) of the input numbers and a postprocessing stage (also known as postnormalization) of the output numbers, see Figure 3.2. Therefore, when implementing a floating-point operator, the required hardware is not only de- voted to the calculation unit itself, but additional logic is needed just to handle the complexity of the format, representing a significant fraction of the area of a floating-point unit. In general, prenormal- ization logic includes:

• Analysis of the type of number of the inputs. Exponent and mantissa analysis.

• Determination of operation exceptions due to the input type of number or sign.

• Normalization of inputs. Denormalized numbers are converted into normalized ones (leading one detection, mantissa shifting and exponent correction). 62 Implementing Floating-Point Arithmetic in Configurable Logic

• Conversion of inputs into the format required by the calculation unit.

In the same way, postnormalization logic includes:

• Rounding of the result. Mantissa and exponent correction. • Determination of result exceptions. • Format of the output to fit the type of number of the result. This is logic suitable for each type of result including shifters, adders or multiplexors.

In addition, to the complexity due to the five types of numbers sharing a common format, another source of complexity is found in floating-point arithmetic, the rounding of the result for any operation. As for any other arithmetic, the exact result can need more mantissa bits than the bits provided by the format. In this, case the result needs to be rounded to comply with the format as its value will be comprised between two consecutive representable floating-point numbers. However, in the IEEE floating-point standard there are four different rounding policies leading to four different rounding methods and, in this way, adding extra complexity. The rounding methods are:

• Nearest: rounding to the nearest value. • Up: towards +∞. • Down: towards −∞. • Zero: towards 0.

Handling the four rounding methods implies a new degree of complexity and requires more re- sources.

3.3. Floating-point Units for FPGAs. Adapting the Format and Standard Compliance

When adapting floating-point units to FPGAs, one way of improving the performance of FPGA floating-point operators is to simplify the complexity of the format so its associated processing is reduced. Following this idea, several modifications can be introduced in floating-point units lead- ing to custom formats more suited for FPGA implementation where the complexity of the format is reduced. However, and depending on the modifications introduced, the standard compliance of the results obtained can be affected leading to several trade-offs between performance and standard compliance. In this Thesis, three key design decisions have been thoroughly studied:

1. Simplification of denormalized numbers. 2. Limitation of rounding to just truncation towards 0. 3. Introduction of specific hardware representations for the different types of numbers. 3.3 Floating-point Units for FPGAs. Adapting the Format and Standard Compliance 63

These decisions affect to three features of the format that are responsible of most of the pre and postnormalization overhead: handling denormalized numbers (1), rounding (2) and handling the five types of numbers with a common format (3). Next, the implications of these design decisions are studied.

3.3.1. Simplification of Denormalized Numbers

The use of denormalized numbers is responsible of most logic needed during prenormalization and postnormalization. However, most floating-point arithmetic algorithms work with a normalized man- tissa (leading bit with a value of 1) in the calculation unit. Thus, denormalized numbers have to be converted to normalized ones during prenormalization. This requires the detection of the leading one the mantissa, the left shift of the mantissa and the adjustment of the exponent value depending on the position of the leading one. During postnormalization the result of the calculation unit has to be analyzed to determine the type of number it corresponds to, to make some adjustments to its exponent and mantissa. Again, most logic related to these two tasks is required for handling the case of results that are denormalized numbers. Consequently, the use of denormalized numbers requires significant resources, negatively affect- ing the performance of the arithmetic units. If non-pipelined operators are used, this logic is in the critical path, while if pipelined operators are used, stages handling denormalized numbers are usually on the slowest paths so extra stages must be introduced. However, the use of denormalized numbers does not contribute substantially to a huge number of applications, because these numbers:

• Represent a small part of the format. Most of the floating-point format is reserved for normal- ized numbers. For example and considering single precision1, there are 223 × 254 different normalized numbers while only 223 denormalized ones.

• Are infrequent: the value of denormalized numbers (< 2−126) also makes their use very infre- quent except for some special applications.

• Compromise the accuracy of results: while normalized numbers have a precision of 24 bits the denormalized precision varies from 23 bits to 1 depending on the position of the msb bit. This compromises the accuracy of the result of any operation where denormalized numbers are involved.

Therefore, one way to simplify the format and to reduce logic is handling denormalized numbers as zeros so all the logic needed to normalize a denormalized input or to format a denormalized result

1Also base precision for the rest of the chapter. 64 Implementing Floating-Point Arithmetic in Configurable Logic is eliminated. Actually, all commercial FPGA floating-point libraries [Alta, Xilb] follow this scheme and previous works as [GSP05, WBL06] also deviate this way from the standard.

3.3.1.1. Deviation from the Standard

The standard cost of this solution is related to resolution and accuracy. First, when denormalized numbers are replaced by zeros we are losing resolution around the zero, as now 223 numbers with values between 2−126 − 2−149 and 2−149 have been removed. Second, accuracy is reduced because we operate with zeros at the inputs or obtain a zero re- sult when we have a denormalized input or output. However this loss of accuracy is relative as the resolution of a denormalized number depends on the position of its leading 1. Due to the lack of reso- lution or due to a rounding from a previous operation, the maximum relative error of a denormalized number can reach even 100% of its value2, being much larger than the maximum relative error for a normalized number, 2−23.

3.3.2. Truncation Rounding

As stated before in Section 3.2.1, the floating-point standard involves four rounding methods requiring extra logic. In the calculation stage, the result of any operation is generated with more mantissa bits than those required by the floating-point format: • Guard bits (overflow and underflow): bits necessaries for operations where the leading one of the result can be found shifted one position right or left. • Round bit: the reference bit for deciding the value of the rounding. • Sticky bit: an additional bit for rounding that groups the value of all the extra remaining bits of the calculated mantissa. At the postnormalization stage, rounding methods analyze these extra bits of the result and the sign of the result (for rounding methods towards +∞ and −∞) to carry out the rounding. The first method, rounding to nearest, provides the most accurate rounding, as it ensures a maximum error of 1 2 ulp (unit in the last place, that is the least significant mantissa bit, lsb) while for the other three methods the maximum error is up to 1 ulp. Meanwhile, rounding to zero is the method that less logic needs because it is equivalent to truncating the result without taking into account the round bit and the sticky bit. This last method is useful for hardware floating-point units, specifically for units that implement iterative algorithms like division or square root. If only rounding towards zero is implemented, these units can be reduced as they do not need to generate the round bit and the sticky bit, consuming less cycles and resources and, in case of pipelined architectures, less stages.

2When the leading 1 of the mantissa corresponds to the lsb and rounding mode is up or down. 3.3 Floating-point Units for FPGAs. Adapting the Format and Standard Compliance 65

3.3.2.1. Deviation from the Standard

Truncation rounding slightly affects the accuracy of a single operation, and this only happens if we compare with round to nearest, which ensures a rounding error of up to half ulp. If we just take into account one operation, this small loss of accuracy can be considered negligible as we are in the range of a relative error of 2−23. However, for applications with a large number of chained operators, error propagation must be considered, as the errors introduced in the first operations spread along the chain while the following operators also introduce error in the partial results. In these cases, the error in the final result can be much higher than if round to nearest is used. Finally, another problem can compromise the accuracy of the results obtained if truncation round- ing is used: the results are biased as truncation always rounds to the same direction, diminishing the absolute value of the result for each operation.

3.3.3. Hardware Representation

As explained before, the first task in any operation using a floating-point number is to analyze the values of the exponent and the mantissa of the operands to determine the type of number. Meanwhile, the last task of the operation is to compose the exponent and mantissa of the result taking into account the type of number of the result. These two tasks are necessary to make compatible the software ar- chitecture of a fixed word-length (mainly 32 or 64) and the use of the floating-point standard implying different types of numbers. However in an FPGA architecture, the word-length can be flexible and configurable by the de- signer and can be extended with some flags to indicate the type of number. To represent the five types of numbers three flag bits would be necessary. However, if denormalized numbers are converted to zeros, two flag bits are enough. This scheme follows the internal use of flags of the FPLibrary [DdD] that we have extended with a systematic approach using interface blocks. Two interfaces are required: first, the input interface which calculates the value of the flags from a standard floating-point num- ber, second, the output interface that composes a standard floating-point number from our custom floating-point number and its corresponding flags. The two interfaces carry out some of the pre and postnormalization tasks so the logic needed in the arithmetic unit can be reduced. With our systematic approach, these interfaces are required every time a piece of input data has standard floating-point format or when a piece of output data needs the standard floating-point format. When a datapath involves chained operations, the advantages of this scheme are clear as the input interface is only needed for each input number while the output interface applies only to the final result. For all the intermediate operations there is no need of interfaces so all the operators involved are reduced. 66 Implementing Floating-Point Arithmetic in Configurable Logic

3.3.3.1. Deviation from the Standard

This design decision has no standard cost in terms of resolution or accuracy, and meanwhile, the use of interfaces makes the transformation between the standard format and the extended FPGA format transparent.

3.3.4. Global Approach Analysis

The three design decisions just explained insist on the same point, the simplification of the floating- point format complexity. Handling complexity requires logic resources, thus as the complexity is reduced so are the logic resources needed for an operator. Additionally, as less resources are used, the operators implementations become more efficient, and speed is also improved or the number of pipeline stages is reduced. Therefore, our approach has focused on determining those features of floating-point arithmetic that require a heavy processing overhead while having minimal relevance in the format. Three design decisions are under study, and while the use of dedicated flags does not have any effect on the standard compliance, the other two design decisions have an impact on the compliance in terms of accuracy and of resolution around zero. Finally, and regarding floating-point standard compliance, one last issue should be addressed when using FPGAs, the non-associativity of floating-point operations. FPGA datapaths are com- monly designed taking advantage of FPGA intrinsic parallelism by placing parallel operators in the datapath. However the results obtained with parallel operations may differ from the ones obtained with an equivalent sequential implementation [KD07]. The parallel architecture implies a different datapath with different partial results that can lead to a different result. Therefore, two issues affect standard compliance: how the floating-point operators are imple- mented and how the datapath is designed. The second issue depends on the architecture and cannot be analyzed in a general way, being out of the scope of this work. To study the impact of the three design decisions on floating-point operators performance, a se- ries of libraries of operators following those design decisions has been implemented, section 3.5. All libraries are single precision arithmetic as double precision requires at least twice of resources than single precision, while requiring more pipeline stages and achieving lower frequencies [Alta]. Although double precision presents higher accuracy, the accuracy required in the results of many ap- plications can be achieved with single precision or a tailored precision which is much closer to single than to double precision [LLK+06]. Furthermore, the conclusions of the study we perform here for single precision can be easily generalized to double precision. The developed libraries are composed of six operators, the four basic mathematical operators (ad- dition/subtraction, multiplication, division and square root), which traditionally composed previous 3.4 Operators Architecture 67

Table 3.3: Logic Reduction due to Design Simplifications. Prenormalization Postnormalization √ Denormalized Leading one detection (∗, /, , ln, xy) Mantissa shifting (+, ∗, /, ex, xy) √ Numbers Mantissa left shifting (∗, /, , ln, xy) Exponent correction (+, ∗, /, ex, xy) √ Simplification Exponent correction (+, ∗, /, , ln, xy) Truncation - Rounding bits evaluation (All) Rounding Rounding (All) Hardware Mantissa analysis (All) Format exponent (All) Flags Exponent analysis(All) Format mantissa (All)

floating-point libraries, and two more complex operators, exponential and logarithm. For each operator, we have gradually applied the above mentioned design simplifications, from standard operators to operators including the three simplifications. These simplifications imply major changes on the pre and postnormalization logic of each operator, reducing the resources needed. Table 3.3 is a generalization of the tasks that are eliminated for each design decision and each operator. The calculation stage of the operators remains almost unchanged. Only the logic needed for computing the extra bits required for rounding is removed when the truncation rounding is introduced. In the following, we briefly analyze how we have implemented each operator.

3.4. Operators Architecture

The goal of this work has not been to develop new algorithms for floating-point operators but to study how the format complexity can be better handled when the target hardware is an FPGA. Therefore, known algorithms that are suited for FPGAs have been selected. In this section, the architecture of the implemented floating-point operators is summarized focusing on the impact of the complexity simplifications on their architectures.

3.4.1. Adder/subtracter

The calculation stage of a floating-point adder-subtracter unit does not present any particular com- plexity. For calculating the result’s mantissa it is just needed a mantissa shifter to align both mantissas, an or reduction, to compute a sticky bit with the shifted bits out of range, and a fixed adder and sub- tracter (depending on the sign of the inputs and on which one is bigger). Meanwhile, the exponent of the result is considered equal to the exponent of the biggest operand and the sign depends on the input signs and on which operand is bigger. With respect to prenormalization and postnormalization, the adder/subtracter can be considered an exception with respect to the rest of operators as most logic needed to handle denormalized numbers 68 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.3: Adder-Subtracter. is subsumed in the logic needed to align mantissas or in the result analysis. As can be seen in Figure 3.3, during prenormalization, in addition to analyzing the type of num- bers at the inputs, the bigger operand is determined by comparing exponents and mantissas and swap- ping them in case the biggest operand is op_b (the calculation stage expects op_a to be the biggest operand). The difference between both input exponents is calculated to align both operands. In postnormalization, firstly, the position of the leading one of the obtained mantissa is detected, as a carry can be obtained if an addition is done or a cancelation of the most significant bits can happen in a subtraction. Afterwards, the mantissa is shifted and rounded while the exponent is corrected depending on the position of the leading one and whether there has been a carry in the rounding. Finally, the result is formatted. A particular case happens in the adder-subtracter when the flags for the type of number are used. A zero operand generated in another operation unit of the library can have an exponent and a mantissa with values different to zero as the zero value will be only indicated by the flags. As a zero input does not generate an exception in this unit, the value of the exponent and the mantissa must be set to zero to obtain the correct result.

3.4.2. Multiplication

Nowadays FPGAs include embedded multipliers that can be directly used to multiply the input man- tissas. Since current embedded multipliers (18x18 multipliers for Virtex 4 and Stratix III and IV, 25x18 multipliers for Virtex 5 and 6) have at least one of their two operand inputs with a bit width 3.4 Operators Architecture 69 smaller than the 24 bits mantissas required for single floating-point precision, our architecture takes advantage of the multiplication distribution property: (a + b) × (c + d) = a × c + a × d + b × c + b × d and splits de input mantissas into two parts, one corresponding to the most significant bits (upper parts, xu and yu) and the other part corresponding to the least significant bits (lower part, xl and yl) by multiplying these subparts in parallel. Afterwards, the results of the partial multiplications are added taking into account the necessary alignment between operands3. The sign of the result is calculated with an XOR gate while the exponent is obtained by adding the input exponents4. As in the adder-subtracter, a carry may be obtained in the mantissa result so an additional exponent adjustment is needed in postnormalization. Apart from analyzing the input type of numbers, the prenormalization tasks are related to convert- ing denormalized inputs into normalized format. The calculation stage expects input mantissas with a 1 as their most significant bit, requiring leading one detection, mantissa left shifting and exponent correction in case there is a denormalized input. As multiplication is carried out with two mantissas with a one in their most significant bit, the 5 mantissa obtained will have its leading one in the bit mb ∗ 2 + 1 (bit corresponding to the calculated exponent) or in the bit wF ∗ 2 + 2. When denormalized numbers are handled, previously to the rounding, the obtained mantissa has to be right shifted to align denormalized numbers. This shifting is used also in normalized numbers to shift right one position in case the leading one bit is in bit wF ∗ 2 + 2. Afterwards the mantissa is rounded. As a carry can be obtained, a final exponent and mantissa correction are needed. When only normalized numbers are used, rounding can be done directly, only taking into account where the leading one of the mantissa is. In Figure 3.4(a), an architecture including denormalized numbers is shown, while in Figure 3.4(b) the architecture has been simplified for handling denormal- ized numbers as zeros.

3.4.3. Division

The architecture of the divisor is quite similar to the multiplier architecture as division is the multipli- cation inverse function (the exponent result is now calculated by subtraction). However, since there are no embedded divisors in current FPGAs, the mantissas division has to be implemented with logic. Binary division can be implemented using several algorithms [OF97], like digit recurrence meth- ods as restoring, non-restoring or SRT algorithms [Fre61], or algorithms based on Taylor series [HFMF99].

3For Virtex 5 and 6, this scheme will correspond to the equation a × (b + c) where only one input mantissa needs to be split. 4We will not go into the details of the handling of the exponent bias. The same will be done in the rest of the units. 5Considering the least significant bit as bit 0. 70 Implementing Floating-Point Arithmetic in Configurable Logic

(a) Handling Denormalized Numbers. (b) Only Normalized Numbers.

Figure 3.4: Multiplier Architecture

From these methods, the most common implementations are the digit recurrence methods where one bit (restoring, non-restoring and radix-2 SRT algorithms) or more bits (radix-4 SRT, radix-8 SRT, etc. algorithms) of the mantissa result are calculated per division step. We have selected the non-restoring algorithm for our libraries due to its better performance when compared to other digit recurrence methods, as it has the simplest division step, and due to its logic requirements when compared to Taylor series methods, as these methods require of multiplications and look-up tables [WBL06]. This choice implies a division with the highest number of division steps needed to obtain the result, and therefore more clock cycles than with SRT algorithms with a radix bigger than two [HU07]. However, it presents the best performance per division step what makes this algorithm the most suited for high clock rates. In Figure 3.5 the architecture of the non-restoring division step is depicted. The sign, qi, of the partial result, PRi, from the previous step is the bit i from the mantissa result inverted (the inverter gate is not in the figure). If PRi was positive, the operation to carry out is PRi+1 = PRi ∗ 2 − divisor while if it was negative is PRi+1 = PRi ∗ 2 + divisor.

Both operations are carried out in the adder as the divisor, in case PRi was positive, is converted into a negative number bit by bit with XOR gates, while the plus 1 is introduced in the lsb of the other operand. Inversely to the multiplication overflow, now an underflow is generated when the dividend man- tissa is smaller than the divisor mantissa. In this case, the obtained exponent has to be decreased by one, while two options can be considered to obtain the correct mantissa. The first one is to divide directly the dividend mantissa so the mantissa result requires left shifting of one position in case of underflow. The second one is to handle the underflow in the prenormalization (shifting left dividend mantissa when necessary) or in the first iteration of the binary division. These two last options obtain the mantissa result directly with a leading one in its most significant bit and make unnecessary the use 3.4 Operators Architecture 71

Figure 3.5: Division step. of a guard bit. In our case we have followed this solution making a dual first iteration step, computing in parallel Dividend − Divisor and (Dividend ∗ 2) − Divisor. For the cases where the dividend mantissa is bigger or equal than the divisor mantissa we will continue the division with the result of the first subtraction, and with the result of the second subtraction for cases when the divisor mantissa is bigger. This way we can ensure that there is no underflow. With respect to prenormalization and postnormalization, these tasks are the same as in the mul- tiplier with two differences, the previously mentioned underflow instead of carry, and that in the rounding no carry can be obtained (and consequently there is no need of exponent correction besides subtracting one in case of underflow).

3.4.4. Square Root

The square root calculation is based on one of its mathematical properties:

√ √ √ √ if x = a ∗ b => n x = n a ∗ b = n a ∗ n b

This property perfectly fits with the values of floating-point numbers (x = s∗m∗2e). For positive values (the only ones valid as input for the square root) the square root of a floating-point number can be calculated as:

√ √ √ x = m ∗ 2e (3.2) √ e e e e where 2e is very easy to calculate. When e is even it result is 2 2 as 2 = 2 2 ∗ 2 2 . And for an odd e, it is needed to convert e into even by subtracting one (multiplying m by two):

√ e √ • Even e → x = 2 2 ∗ m √ e−1 √ • Odd e → x = 2 2 ∗ 2 ∗ m √ The calculation of m is also simplified due to the floating-point nature. For normalized numbers m will have a value in the range [1,4), due to the multiplication by two for odd e, so the output is limited to [1,2). 72 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.6: Square Root.

When the input number is denormalized, the calculation method is the same. It is only needed to transform the denormalized mantissa into a normalized one by multiplying by a power of two, and subtracting that power of two from e. Then the previous method can be applied directly. The square root of m can be computed very similarly to division when a digit recurrence algorithm is implemented and thus a non-redundant algorithm has been selected again. The calculation unit follows the non-restoring algorithm presented in [LC97]. This algorithm is especially well suited for FPGAs as it works with bit width reduced operands. As in the divider each step is composed of a conditional negation and an addition, but now, the bit-width of each step (and of that operations) is determined by the bit-width of the partial result calculated before that step. Regarding pre and postnormalization, prenormalization includes the normalization of the input number and its analysis, while postnormalization is simplified with respect to other operators as no denormalized number can be obtained as result of a square root and additionally rounding cannot present a carry so no exponent correction is needed since the mantissa is directly aligned after round- ing. The architecture of the square root is shown in figure 3.6.

3.4.5. Exponential and Logarithm Units

Our exponential and logarithm functions operators are based on the previous work of Detrey and Dinechin [DdD05a][DdD05b][DdD07]. The technique used in both works is to reduce the input range and then use table-driven methods to calculate the function in the reduced range [Tan89, Tan90]. Although both operators were designed suiting with FPGA flexibility (using internal tailored fixed- point arithmetic and exploiting the parallelism features of the FPGA) their automatic generation presents some inefficiencies. We have improved the original implementations, redesigning them and including the following features:

• Redesign of units to deal only with single precision. The feature of bit-width configurability of the base designs has been removed with the corresponding saving of resources. 3.4 Operators Architecture 73

• Simplification of constant multiplications. Conventional multipliers have been removed where the multiplications involved constant coefficients, being replaced by KCM multipliers [BDM08]. As KCM are FPGA-oriented constant multipliers, the performance is improved while logic re- sources are reduced.

• Use of unsigned fixed-arithmetic. In [DdD05a, DdD05b] internal fixed-arithmetic with sign is used. However, some operations (like the ones involving range reduction and calculation of the exponent for the result in ex) are consecutive and related, and the sign of the result can be inferred from the input sign. For such operations signed arithmetic has been replaced by unsigned arithmetic with the corresponding reduction in logic.

• Improved pipelining. The speed is enhanced by systematically introducing pipeline stages to the datapath of the exponential and logarithm units and their subunits.

3.4.5.1. Exponential unit

The algorithm [DdD05a] for ex is as follows. Using the following expression for x the calculation of exponential function of a floating-point number is greatly simplified:

x ≈ k ln 2 + y → ex ≈ 2key (3.3) being y the number that verifies that transformation. This way the computation of the exponent and the mantissa (ey) of the result is split. On the one hand, the exponent of the result is calculated approx- imately with a deviation of ±1 (k in Figure 3.7), while an input range reduction technique is applied to obtain a smaller number range for the mantissa computation that is an exponential computation, ey. The exponential unit works internally from a fixed-point number, x_fixed, obtained from the floating-point x in an input shifter as can be seen in the Figure 3.7 (to fixed box). k is calculated 1 by multiplying this fixed-point number times ln 2 and rounding the result, while y is obtained by subtracting k × ln 2 from x_fixed. The calculation of ey generates the mantissa of the result, and also involves a second range re- duction as y can be split by using the exponential property ea+b = eaeb. In this way, ey = ey1 ey2 is split, y1 corresponding to the most significant bits of y and y2 to the least significant bits of y. The calculation of both exponentials is based on table-driven methods, and while ey1 is calculated directly, y y e 2 is calculated indirectly using the Taylor formula, and then reconstructing e 2 from T (y2):

y2 y2 y y1 y2 y1 y1 T (y2) ≈ e − 1 − y2 → e ≈ 1 + y2 + T (y2)e = e e = e + e (y2 + T (y2)) (3.4)

During prenormalization, denormalized numbers are handled by all the libraries as zeros as the 74 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.7: Exponential function unit. 3.4 Operators Architecture 75

Figure 3.8: Logarithm function unit exponential of any denormalized number gives the same result e0, 1. However during postnormaliza- tion, denormalized numbers can be obtained and must be handled (right shifted). Finally the exponent has to be adjusted depending on the value of the mantissa to eliminate the possible ±1.

3.4.5.2. Logarithm Unit

The algorithm used for ln x [DdD05b] is shown in Figure 3.8. Due to the logarithm properties of ln(ab) = ln(a) + ln(b) and ln ab = b ln(a) the logarithm of a normalized number can be computed as:

y = ln x → y = ln(1.mnt) + (exp − bias) × ln 2 (3.5)

During the direct computation of this formula it may happen a catastrophic cancelation of the two terms. Consequently, the internal architecture of the unit will need very large bit widths for internal variables and very large operators to maintain error bounds. To avoid this problem, the algorithm centers the output range of the function around zero: √ • y = ln(1.mnt) + (exp − bias) × ln 2; when 1.mnt ∈ [1, 2) √ • 1.mnt − × ∈ y = ln( 2 ) + (1 + exp bias) ln 2; when 1.mnt [ 2, 2) 76 Implementing Floating-Point Arithmetic in Configurable Logic where the first element can be denoted as ln M and the second element (before it is multiplied by ln2) as E, see Figure 3.8. √ √ ∈ 2 In this way the input range of ln M is reduced to M [ 2 , 2), with an output range of − ln 2 ln 2 [ 2 , 2 ). The calculation of ln M is done through polynomial methods and, to achieve less than one ulp error at smaller cost, ln M is not calculated in only one step but in two. When M is close to − ln M 1, ln M is close to M 1 so, instead of computing ln M, less resources are needed if f(M) = M−1 is calculated and then ln M reconstructed by multiplying by M − 1. Once reconstructed, ln M is added to E × ln 2 obtaining the result in a fixed-point format which is transformed into a floating-point number in the postnormalization unit. This unit is composed of a leading one detector unit and a shifter to obtain the mantissa of the result. The exponent of the result will correspond to the exponent of the leading one on the fixed-point number As the logarithm function is the inverse function of the exponential one, now denormalized num- bers only need to be handled at prenormalization, as no denormalized number can be obtained as result.

3.5. Libraries Evaluation and Comparison

Four libraries have been developed to study and compare the impact of each design decision (Sec- tion 3.3), where the design simplifications have been introduced gradually. The resulting libraries are:

• Std: operators without any significant change with respect to the floating-point standard (sup- porting the four rounding methods and denormalized numbers). Exception handling and NaN signaling is not provided. • Norm: operators handling denormalized numbers as zeros. • 0_Rnd: operators with only truncation rounding mode and also handling denormalized num- bers as zeros. • HP (High Performance): operators designed including all three design decisions; denormalized numbers as zeros, truncation rounding and use of flags for the type of number.

As previously mentioned, high performance floating-point operators require deeply pipelined im- plementations where the optimal number of stages for each operator depends on what design features or properties are prioritized. The criterion followed to determine the operator’s number of pipeline stages has been to achieve a high clock frequency with a reasonable number of stages. Thus, we have determined which basic calculation or iteration in any of the operators of the HP library is in the critical path, being its delay the maximum delay allowed for all other operators of HP. In our case, the critical path is found in the multiplier operator and in concrete in the embedded DSPs. As we cannot 3.5 Libraries Evaluation and Comparison 77

Table 3.4: Operators Results. Commercial Library. Xilinx [Xilb] Norm NormXlx Stages Slices MHz Slices MHz Slices MHz +/- 8 418 256.1 402 252.9 395 257.9 * 6 150 246.9 148 253.5 144 255.0 / 26 668 175.7 802 276.9 777 276.9 27 868 290.9 821 293.7 781 296.8 √ 18 418 190.7 411 250.2 389 250.6 pipeline it, the speed of this path have been set as the minimum one for all operators. For the Std and Norm libraries the same design criterion has been followed. Meanwhile, for 0_Rnd we have chosen the same number of pipeline stages as for the operators of the HP library to study the benefits of the use of internal flags without any other changes. The evaluation of the libraries has been carried out in a Xilinx Virtex-4 XC4VFX140-11 FPGA with the ISE 10.1 environment. Results obtained are post place & route and using the same features (balanced mapping, and normal place & route with high effort). The evaluation has been carried out in three main parts.

1. Comparison of operators with a commercial library to verify the quality of the designed li- braries. 2. Study and comparison of the results for each component and for each of the libraries, analyzing the impact of each design decision on pre and postnormalization logic. 3. Analysis of the capabilities of current FPGAs to implement floating-point operators.

3.5.1. Comparison with respect to a Commercial Library

We have selected the Xilinx floating-point operators (logic core Floating-Point Operator v4.0 [Xilb]) as reference commercial library. This library is parameterizable (variable exponent and mantissa bit-widths and number of pipeline stages) and presents two deviations concerning the floating-point standard: denormalized numbers are not supported and rounding is restricted to round to nearest. Consequently, Xilinx operators are almost equivalent to the ones of our Norm library, only dif- fering in the calculation of the rounding as Xilinx does not support the four rounding modes of the floating-point standard, see Section 3.3.2. Therefore, to make an exact comparison we have tuned our Norm operators restricting rounding to round to nearest, NormXlx operators in the following. Finally, we have configured Xilinx and NormXlx operators with the same number of stages, also the same as the ones of Norm, except in the divisor where also an additional stage has been added to meet the Xilinx optimum, 27 stages. Some common features for the four basic operators can be observed in Table 3.4. Our operators 78 Implementing Floating-Point Arithmetic in Configurable Logic

Table 3.5: Operators Results. All Std Norm 0_Rnd HP DSP BRAM Slc Stg MHz Slc Stg MHz Slc Stg MHz Slc Stg MHz +/- - - 414 9 251.6 402 8 252.9 369 6 267.5 344 6 286.4 * 4 - 457 10 250.9 148 6 253.5 109 5 242.7 102 5 250.0 / - - 1037 29 256.6 802 26 276.9 753 24 274.3 717 24 287.9 √ - - 515 20 250.0 411 18 250.2 342 16 250.4 328 16 256.8 ex 4 1 590 18 250.6 482 16 244.4 463 15 258.6 449 15 253.9 ln 5 2 878 18 234.0 777 16 250.6 737 14 250.3 732 14 250.3 achieve better frequencies, with a remarkable increase in both divisor (for our optimum number of stages) and square root. In the first case, the divisor, the 100 MHz increase is due to the design improvement of a dual first division step so the calculation of the guard bit is unnecessary. This makes that the Xilinx operator is configured with one stage less than its optimum. However, if divisors are configured with the Xilinx optimum number of stages, 27, our operator is still faster. In the second case, the square root, there is a 60 MHz increase due to the algorithm selected or how it is implemented, as the Xilinx operator is below 200 MHz until it is configured with 26 stages. With respect to the resources used, all our Norm operators are slightly smaller even though they implement the four rounding methods (except for the division with 26 stages, which is not comparable as ours is 101 MHz faster). When comparing the Xilinx operators with the NormXlx ones, the improvements are increased as we are removing logic from our operators. The biggest impacts can be observed in the divisor and in the square root due to the removal of part of the sticky bit calculation logic. This is possible in both operators as the calculation of the sticky bit is different in the rounding to nearest and rounding up or down methods requiring independent logic as the sticky bit is calculated in parallel during the last iteration step to avoid an extra stage. No comparison is possible for the two advanced operators, the exponential and the logarithm, as this commercial library, like almost all other libraries, only presents the four basic operators.

3.5.2. Operators Evaluation

Table 3.5 and Table 3.5 shows the experimental results (clock frequency, number of pipeline stages and number and type of logic resources) obtained for the implemented operators while Figure 3.9 depicts them graphically. Regarding all figures of merit, the final library, HP, outperforms the standard library Std, being each operator faster while requiring less resources and less pipeline stages. And as each design deci- sion is introduced each library outperforms or equals the previous one except for the clock frequency. For each design decision, we are removing logic (completely or partially) and therefore the clock 3.5 Libraries Evaluation and Comparison 79

Table 3.6: Slices type of logic per operator Std Norm 0_Rnd HP LUT FF LUT FF LUT FF LUT FF +/- 587 438 535 421 521 356 481 362 * 663 337 203 153 127 147 114 153 / 1301 1409 909 1317 806 1259 733 1257 √ 651 661 477 610 418 537 375 527 ex 936 594 783 501 795 477 746 478 ln 1415 947 1255 878 1206 799 1189 805

(a) Clock Frequency.

(b) Pipeline Stages. (c) Logic Resources.

Figure 3.9: Operators Evaluation. frequency is increased if the pipelined architectures are not changed, as it is our case. However, there are several exceptions to this general trend and mainly in operators using DSPs. We have found it is due to the P&R algorithms which do not ensure that the optimum implementations is achieved. The importance of all improvements can be analyzed together comparing the HP operators to Std. The reduction of slices is between a 77.7% (multiplier) and a 16.6% (logarithm), while pipeline stages are reduced between 5 (multiplier and divisor) and 3 (adder and logarithm) stages. We have analyzed separately each stage: prenormalization (pre), operator calculation (cal) and postnormalization (pst), to study in detail the impact of the simplifications introduced on the operators and their impact on pre and postnormalization overheads. In Tables 3.7 and 3.8, the results obtained in this way are shown, focusing on two metrics, the number of pipeline stages required and the number of resources needed. To have a global perspective, we can imagine a synthetic datapath composed of the six operators. The results obtained for that datapath will be equivalent to the ones in Figures 3.10, where we have 80 Implementing Floating-Point Arithmetic in Configurable Logic

Table 3.7: Split Slices comparison.

Std Norm 0_Rnd HP Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. +/- 414 146 128 198 402 146 128 192 369 146 124 155 344 138 124 134 * 457 262 75 179 148 43 75 113 109 43 63 51 102 4 63 50 / 1037 252 729 127 802 47 729 64 753 47 695 46 717 4 695 12 √ 515 150 355 49 411 39 355 39 342 39 311 16 328 29 311 0 ex 590 124 301 186 482 124 301 105 463 124 301 74 449 119 301 38 ln 878 130 581 231 777 25 581 231 737 25 581 178 732 3 581 173

Table 3.8: Split Pipeline Stages comparison.

Std Norm 0_Rnd HP Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. Unit Pre. Cal. Pst. +/- 9 2 2 5 8 2 2 4 6 2 2 2 6 2 2 2 * 10 2 4 4 6 1 4 3 5 1 4 1 5 1 4 1 / 29 2 25 3 26 1 25 2 24 1 24 1 24 1 24 1 √ 20 2 17 1 18 1 17 1 16 1 16 1 16 1 16 0 ex 18 1 13 4 16 1 13 3 15 1 13 2 15 1 13 2 ln 18 2 12 4 16 1 12 4 14 1 12 2 14 1 12 2 grouped the results from both tables per library of operators. In Figure 3.10(b), keys Pre-Cal and Cal-Pst correspond to the shared pipelined stages between prenormalization and calculation stages and between calculation and postnormalization stages.

3.5.2.1. Denormalized

Comparing Std with Norm operators we can observe a reduction of the number of slices required from 67.6% (multiplier) to 11.5% (logarithm). The adder can be considered a special case as almost all the extra logic in Std is subsumed in the logic of Norm.

(a) Slices. (b) Stages.

Figure 3.10: Slices and Stages per type for each library. 3.5 Libraries Evaluation and Comparison 81

The overhead due to the handling of denormalized numbers can be mainly found in the prenor- malization stage. As it was seen on Table 3.3, the prenormalization stage of almost every operator has several tasks only related to denormalized numbers. These numbers need to be converted to the normalized number format with a one in the first bit, requiring a leading one detection and a mantissa left shift while their exponent values have to be corrected. These tasks have a major impact on the resources required, Table 3.7, mainly in the multiplier and the divider, as in both operators the resources for these tasks have to be replicated for their two inputs. On the other hand, the impact on the postnormalization stage is much smaller because there is only one output to handle and part of the logic needed for handling denormalized numbers is shared with the logic required for handling the result. With respect to pipeline stages, not handling denormalized numbers reduced the stages needed in two ways:

• Elimination of stages: as some tasks with dedicated logic are removed from the datapath.

• Overlap of stages: when denormalized numbers are no longer handled, the calculation stage of some operators can directly work with the input numbers. In parallel, the prenormalization stage analyzes the type of the input numbers and processes the exceptions due to not normalized inputs.

3.5.2.2. Rounding

Introducing the limitation of rounding methods to just rounding towards zero, 0_Rnd, implies an additional reduction of slices and stages. Rounding mainly affects to the postnormalization stage as resources are required to evaluate the rounding bits, to carry out the rounding which implies two adders (exponent and mantissa), and a multiplexor for the logic needed for the cases when a carry can be obtained after rounding the mantissa. Regarding the calculation stage, the major impact is on the operators with digit recurrence meth- ods, as extra iterations are needed for computing rounding bits and our architectures use extra re- sources to calculate de sticky bit in parallel with the last iteration step.

3.5.2.3. Types of Numbers

Again, as it happens when not handling denormalized numbers, the major impact of introducing hard- ware flags for the type of number can be found in the operators with two inputs. The logic needed to determine the type of number by analyzing the mantissa and exponent values (comparators) needs to be replicated. Meanwhile, the result is formatted in the postnormalization by using a multiplexor. 82 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.11: Operators Replicability (HP Library).

Table 3.9: FPGA Resources. Virtex 4 Virtex 5 SX55 FX140 FX200 SX240 Slices 24576 63168 30720 37440 BRAM 320 552 384 1056 DSP 512 192 456 516

When this logic is removed, the resources saved are equivalent to the resources needed for the in- terface units. Therefore, the reduction of logic with resources respect 0_Rnd (up to 6.8% in the multiplier) are only effective when implementing chained operators.

3.5.3. Replicability

Replicability is a key issue to address when analyzing the suitability and capacity of modern FPGAs to implement floating-point applications. We can define replicability in an easy way as the number of operators that can be implemented in an FPGA considering that 100% of the resources are available and taking as reference the resources used by one operator. The results obtained following this definition are depicted in Figure 3.11 for HP operators (as they are the ones requiring less resources) and for four FPGAs which present a good mix of elements, two Virtex-4 (SX55, FX140) and two Virtex-5 (FX200, SX240), see Table 3.9. It can be deduced from the plot the great capacity of these FPGAs to implement complex single precision floating-point algorithms, even involving hundreds of operators. However, other facts have to be taken into account in a real scenario such as:

• Routing stress: for large implementations routing congestion can affect very seriously the final performance or make impossible to route already mapped operators. 3.5 Libraries Evaluation and Comparison 83

Figure 3.12: Synthetic Datapath.

• Use of logic resources instead of embedded elements: when no more embedded elements are available, they can be substituted by slices.

• The datapath: replicated logic could be removed by implementation tools for operators sharing the same inputs. Additionally, the input or output registers should be removed for chained operators.

Taking all these features into account, we have designed a synthetic datapath, Figure 3.12, to analyze how many times an operator can be replicated. The datapath is composed of n levels of 10 operators each, while 9 additional operators provide the final output. For coarse grain configurability, n can be increased adding more levels, while for fine grain configurability, operators can be added at the output. The datapath has been designed to prevent implementation tools from removing duplicated logic: there are no two equal inputs, using Zij (the output of each operator) and Zij (Zij with bits reordered), while the operators are registered only at their outputs. We have used as reference FPGA a Virtex-4 XC4VF140 while the design goal&strategy selected has been "balanced" in the Xilinx ISE environment. Among the operators not using embedded elements, we have chosen as operators under study the adder and the divisor, as the adder can be considered a pure combinational operator while the divisor has registered data, the divisor mantissa, through all the iteration steps. Among the ones using embedded elements, again two operators are studied in more detail, the multiplier, only using DSPs, and the logarithm, using both DSPs and Block RAMs. Figure 3.13 shows the results obtained for the adder (the slices results are normalized correspond- ing the value 1 to the 100% of slices of the FPGA). As can be seen, the expected number of operators from Figure 3.11 (in Figure 3.13 first vertical line starting from the left) is widely exceeded, being possible to implement up to 241 adders. The first reason for this is that now only the outputs are reg- istered. The second reason is that, when reaching the limits of the FPGA, implementation tools focus 84 Implementing Floating-Point Arithmetic in Configurable Logic

Figure 3.13: Adder Replicability Results.

Figure 3.14: Divider Replicability Results. their work on area optimization (although results are obtained targeting the balanced goal). Therefore two more theoretical limits have been obtained, the first one with a balanced implementation (203, second vertical line) and with an area oriented implementation (250, third vertical line). It can be seen that 100% of the resources of the FPGA are used before reaching the 241 operators. The number of slices used grows linearly with the number of operators until 100% is reached. Then, the implementation tools reduce the slices needed per operator. At this point, the speed of the data- path, MHz curve graph, is seriously affected routing becomes more difficult for each new operator, harming the performance. The results for the divider are shown in figure 3.14. In this case, the area oriented and balanced limits are the same as the divisor limiting resource is the flip-flop slices. The experimental result obtained is even better than the limit and the speed of the datapath is not affected when reaching the limit as it happened with the adder. The situation changes when the embedded elements are the limiting components, as can be seen 3.5 Libraries Evaluation and Comparison 85

Figure 3.15: Multiplier Replicability Results.

Figure 3.16: Logarithm Replicability Results.

in Figures 3.15 and 3.16. In both cases, the DSPs are the limiting resources and it can be observed that when 100% DSPs are reached, the implementation tools replaced some 18x18 multipliers with logic, and they do it for all the operators. If we focus on the multiplier unit, we can see that for 48 multipliers the 192 DSP are used, being each multiplier configured with its optimum number of DSPs (4). But when a new multiplier is introduced implementation tools replace by logic one DSP per multiplier6, and consequently, there is a big increase in the number of slices used while a big decrease is observed in the number of 18x18 multipliers. The same happens when reaching the next limits, 64 and 96 multipliers. Finally, slices become the limiting resource not allowing to reach the next limit. The same behavior can be observed for the logarithm operator. However, now the limiting re- sources are both the DSPs and the slices, so when the third limit is reached, 64 logarithms, the 100% of both are used, making impossible to replace one DSP per logarithm.

6For 49 multipliers, it would be strictly necessary 45 multipliers with four 18x18 and four with three 18x18. 86 Implementing Floating-Point Arithmetic in Configurable Logic

3.6. Towards Standard Compliance and Performance

As just seen in section 3.5, improved performance can be obtained using sub-standard operators. However, how can we obtain a standard compliant library while trying to preserve improvements of sub-standard libraries? Since using dedicated flags for the type of number has no cost in terms of standard compliance, we can always apply them by using interfaces. The other two decisions cannot be directly introduced if compliance with the standard is a must, so trying to fulfill the standard will require additional modifications.

3.6.1. Simplification of Denormalized Numbers: One Bit Exponent Extension

A number is denormalized for a given floating-point precision depending on the exponent bitwidth x for that precision, wE. It will be denormalized if the value of its leading one corresponds to a 2 w −1 w −1 w −1 w −1 with x in [−2 E − wF ,−2 E + 1], while it will be normalized if x is in [−2 E + 2,2 E ). Consequently, a denormalized number for a given precision becomes a normalized number if that precision is extended one bit. Therefore, the handling of denormalized numbers as zeros in the operators can be applied in com- bination with an extension of the number precision in one exponent bit. Now the interfaces between standard numbers and hardware numbers will be in charge of transforming the numbers. Meanwhile, operators will have to handle numbers with an extra bit in the exponent. However, handling exponents has a small hardware cost in the operators, much smaller than handling denormalized numbers, see Section 3.6.3. Regarding the standard compliance, this solution ensures that we are not loosing accuracy nor res- olution, as the denormalized numbers in the new format will always be zero in the standard precision. Furthermore, more accurate results can be obtained with the extended precision. In datapaths involv- ing several operations we can find that partial results that were zeros or infinities with the standard precision are normalized numbers with the new precision so subsequent operations are more accurate. Furthermore, final results that were a zero, or infinity or NaN can now have a numeric value with the new precision.

3.6.2. Truncation Rounding: Mantissa Extension

A one bit mantissa extension can be applied to try to obtain the same accuracy with truncation round- ing than with round to nearest. The ulp of the new precision with extended mantissa, ulp’, has a value of half of the standard ulp. So the maximum error of extended precision and truncation rounding, 1 ulp’, will equal the maximum error obtained with standard precision and round to nearest, 0.5 ulp. 3.6 Towards Standard Compliance and Performance 87

Table 3.10: Operators results with the final proposed features. √ +/- * / ex ln 378 132 742 366 470 760 Slices +1 385 140 742 368 505 805 Speed 270.0 254.6 286.2 250.1 246.7 258.5 [MHz] +1 273.9 250.6 284.4 250.6 248.3 258.5 Stages 8 6 26 17 16 16

This solution will require that each operator would compute the input mantissas with one extra bit and also generate outputs with the extra bit. Part of the advantages of truncation rounding, not having to compute the rounding and sticky bit, are lost while the calculation unit has to work with input mantissas of one more bit, requiring more resources. Concerning the bias introduced by truncation rounding, this is a source of no standard compliance that cannot be corrected. The generated bias can be reduced by extending the precision with more mantissa bits, making the relative error per operation smaller and so the bias. Nevertheless, the bias can have an important impact on the accuracy of results, for example, in the statistical analysis of large quantities of data for applications requiring a high degree of accuracy. Therefore, as bias cannot be corrected, round to nearest must be kept for standard compliance and to avoid its impact on certain applications.

3.6.3. FPGA-oriented floating-point library

From the previous analysis and discussions, we consider the following features as good choices for almost standard operators while still taking advantage of the FPGA flexibility:

• Use of dedicated flags for the type of number. • Handling denormalized numbers as zeros while the exponent is extended in one bit. • Round to nearest should be kept as rounding method.

Following these recommendations we have developed two hardware libraries: with the one bit exponent extension, HW+1 (+1 indicates the exponent extension), or without the exponent exten- sion, HW. Table 3.10 summarizes the results for the operators developed with those features, while Table 3.11 shows the results for the required interfaces. From the results in both tables we can see that the exponent extension has a major impact on the interfaces, as now they have to handle the conversion between denormalized and normalized numbers and the bias correction as its value is related to the exponent bit-width (one more stage, more resources and less frequency). With respect to the operators, the complex ones are the most affected as in both of them the input exponent is involved in several computations. Thereby, it would be better using 88 Implementing Floating-Point Arithmetic in Configurable Logic

Table 3.11: Required Interfaces. SW-HW HW-SW HW HW+1 HW HW+1 Slices 42 124 35 120 Speed [MHz] 434.2 323.5 700.8 295.6 Stages 1 2 1 2

(a) Clock Frequency.

(b) Pipeline Stages. (c) Logic Resources.

Figure 3.17: Standard Operators Evaluation.

operators with no bit extension if for a given application denormalized numbers are not in the range of partial or final results,. When we compare the results of HW and HW+1 libraries with respect to the other libraries we can see that, even though not all improvements achieved with HP operators can be preserved, the HW and HW+1 results are closer to the results of HP than to Std, see Figure 3.17. The HW and HW+1 operators present less pipeline stages than Std operators while the clock frequency is also closer to HP. Finally, considering the use of slices, the improvements with respect to Std operators are between 12% (exponential) and 67.2% (multiplier) for HW operators and between 10% and 66.1% for HW+1 operators. 3.7 Conclusions 89

3.7. Conclusions

The integration density of current sub-micron technologies has allowed the implementation of com- plex floating-point applications in a single FPGA device. Nevertheless, the complexity of this kind of operators (deep pipelines, computationally intensive algorithms and format overhead) makes their design especially long and complicated. The availability of complete FPGA-oriented libraries sig- nificantly simplifies the design of complex floating-point applications. In particular, we have chosen the following operators: adder/subtracter, multiplier, divisor, square root, logarithm and exponential functions. In this chapter we have presented and studied the following design decisions that can be made to improve the performance of floating-point operators implemented in FPGA architectures:

• Simplification of denormalized numbers. • Limitation of rounding to just truncation towards 0. • Introduction of specific hardware representations for the different types of numbers.

An in depth analysis of the performance-accuracy trade-offs of those decisions has been carried out through the development and comparison of a complete set of floating-point libraries. The fol- lowing libraries have been studied and developed:

• Std: operators without any significant change with respect to the floating-point standard (sup- porting the four rounding methods and denormalized numbers). Exception handling and NaN signaling is not provided. • Norm: operators handling denormalized numbers as zeros. • 0_Rnd: operators with only truncation rounding mode and also handling denormalized num- bers as zeros. • HP (High Performance): operators designed including all three design decisions; denormalized numbers as zeros, truncation rounding and use of flags for the type of number.

In this analysis we have not focused on the underlying computation algorithm, but on the the overhead due to some of the floating-point standard features. The format overhead implies a major use of resources, thus, reducing it is a must to obtain operators that are better suited for FPGAs while still presenting better performance. In particular, it has been shown that the handling of denormalized numbers has a major impact on the FPGA operators. Following the results obtained in the previous study, we have discussed and selected a set of features that implies improved performance and reduced resources. This set, has been chosen to design two additional hardware FPGAs-oriented libraries that ensure (or even improve) the accuracy and resolution given by the standard. The operators of these libraries are the base components for the implementation of target application. 90 Implementing Floating-Point Arithmetic in Configurable Logic

Finally, a second analysis has been carried out to study the capabilities of FPGAs to implement complex datapaths. It has been shown that the huge capabilities of current FPGAs allow up to hun- dreds of single floating-point operators. Although this capacity, this second analysis has also demon- strated how the working frequency of the operators is severely affected by the routing of their elements when the operators are not isolated and a high percentage of the resources of an FPGA are used. The work done in this chapter has been published in a long journal paper [ELV11b]:

• Pedro Echeverría and Marisa López-Vallejo. Customizing floating-point units for fpgas: Area- performance-standard trade-offs. Microprocessors and Microsystems - Embedded Hardware Design, 35(6):535–546, 2011. and in a conference [SEMLV08].

• Miguel Angel Sánchez, Pedro Echeverría, Francisco Mansilla, and Marisa López-Vallejo. De- signing highly parameterized hardware using xhdl. In Forum on specification and Design Languages, pages 78–83, 2008. 4

Exponentiation Operator

In the previous chapter, we focused on the design of the most common mathematical operators with the best features for obtaining accelerators with the best performance-resource combination. The chosen operators have been studied in depth in the literature, and therefore the focus has not been on the algorithms implemented in the operators but on analyzing how to lighten the arithmetic over- head. However, there are other operators that have not been studied so deeply, basically due to their enormous complexity. This is the case of the exponentiation function xy a computationally intensive function widely used in scientific computing, 3D computer graphics or signal processing. In this chapter, this operator is thoroughly studied and we analyze how it can be implemented. The main objective is the implementation of an accurate exponentiation operator taking advantage of FPGA features and flexibility. An architecture based on the use of sub-operators, an error analysis of the partial results and the use of tailored precision is proposed. Finally, a variable precision ex- ponentiation function is developed in collaboration with the FloPoCo project [dDP] led by Professor Florent de Dinechin. 92 Exponentiation Operator

4.1. Exponentiation function

The complexity of the exponentiation function, xy (where x is the base and y the exponent), makes very difficult to implement an efficient and accurate operator in a direct way without any range reduc- tion. However it can be reduced to a combination of other operations and calculated straightforward with the transformation:

z = xy = ey×ln x (4.1) as can be seen in Figure 4.1. The revision of the IEEE-754 standard that defines floating-point implementations [IEE08] chose to offer two independent and consistent functions for the exponentiation: • pown(x,n) is defined only for integer n. This is the function that should be used in polyno- 1 mial evaluators for instance. The standard also includes rootn(x,n) defined as x n . • powr(x,y) is defined by equation ( 4.1), and in particular is undefined for negative x. while a third function merges both: • pow(x,y) is defined for x and y real numbers, and special handling is provided for negative x in combination with integer y. Table 4.1 summarizes how each of the three functions handles the special cases that can lead to exceptions or particular values. We will focus on powr(x,y) and pow(x,y) as pown(x,n) is a special case of powr(x,y). As shown in Figure 4.1, the straightforward translation of equation ( 4.1) requires three sub- operators (a logarithm, a multiplier and an exponential). To develop our exponentiation operator we will follow this approach which presents three main problems that have to be effectively handled: • The complexity of both exponential and logarithm functions. However, as seen in the pre- vious chapter, the use of table-driven methods in combination with range reduction algo- rithms [Tan89, Tan90] makes possible their implementation. • The computation with a negative base results in Not a Number even though the exponentiation function is defined for negative bases and integer exponents as implemented in the standard for the pow(x,y) function.

Figure 4.1: Simplified overview of a power function unit. 4.2 Range and error analysis 93

Table 4.1: Exception handling for the exponentiation function in the IEEE-754 standard.

pow powr pown Case Result Case Result Case Result x±0 any x 1 finite x > 0 1 any x 1 y < 0 and odd integer ∞ finite y < 0 +∞ y < 0 and odd integer ∞ 0y finite y < 0 and not odd integer +∞ y < 0 and even integer +∞ finite y > 0 and odd integer 0 finite y > 0 +0 y > 0 and odd integer 0 finite y > 0 and not odd integer +0 y > 0 and even integer +0 0−∞ - +∞ - +∞ −0+∞ - +0 - NaN +0+∞ - +0 - +0 −1±∞ - 1 - NaN 1y any y 1 finite y 1 y = ∞ NaN 0±0 - 1 - NaN +∞±0 - 1 - NaN xy finite x < 0 and finite y and non integer NaN finite x < 0 NaN

• Equation ( 4.1) can lead to a large relative error in the result. Although the sub-operators were almost exact (up to half ulp of maximum error) the relative error from each sub-operator spreads through the equation generating a final large relative error [Mul06], see Section 4.2.2. Extending the precision of the partial results is an effective way to minimize these relative errors, see Section 4.2.3.

4.1.1. Related Work

Only some some approximations to the xy function can be found for FPGAs in the literature. A technique and a hardware architecture for the exponentiation functions are presented in [PBM01]. This implementation only deals with a previously known constant exponent. This work is extended in [PEB04] allowing configurable exponents that can be integers or the inverse of an integer. It presents a detailed architecture based on equation ( 4.1), where the exponential and the logarithm are computed relying on high-radix iterative algorithms with redundant number representations. It is unclear whether this work was implemented, and the choices made (in particular the use of redundant arithmetic) are more suited to VLSI than FPGA, where both efficient additions (through fast-carry lines) and efficient multiplications are available.

4.2. Range and error analysis

For an elementary function, ensuring the round to nearest 0.5 ulps error bound requires evaluating the function with a very large intermediate precision, typically twice the result precision wF [Mul06] being wF the mantissa bit width. A solution, sometimes used in software, is to compute with such 94 Exponentiation Operator

Table 4.2: Sub-operators Range Analysis. Input Range Output Range function single double function single double ex (−∞, ∞) (-87.3, 88.72) (-708.4, 709.78) (0,∞) (0,∞) (0,∞) ln x (0,∞) (0,∞) (0,∞) (−∞, ∞) (-87.33, 88.72) (-708.4, 709.78) × (−∞,∞) (−∞,∞) −∞,∞) (−∞, ∞) (−∞,∞) (−∞,∞) precision only when needed. This solution does not allow a fixed-latency hardware implementation. Therefore (and in line with most software implementations), our goal is to allow a slightly larger error, with an error bound that remains smaller than 1 ulp, what is known as faithful rounding. From another point of view, the hardware operator shall return one of the two floating-point numbers surrounding the exact result. This error bound also ensures that if the exact result is a representable floating-point number the operator will return it.

4.2.1. Input-output range analysis

Previously to analyze the exponentiation function implemented with equation ( 4.1), its sub-operators have to be studied. The operators’ underflow and overflow situations will define the intervals of possi- ble values for the intermediate variables. In this analysis we have not taken into account denormalized numbers as we will focus on the features selected in the previous chapter. In an implementation based on xy = ey×ln x, the result comes from the exponential function. First, we can focus on its output. Defining α and ω as the smallest and largest positive floating-point numbers respectively, if the exact value of xy is smaller than α or greater than ω, we have to return +0 or +∞ respectively. This defines the useful range of the product p = y × ln x which is input to the exponential function:

mp = log(α), rounded up

Mp = log(ω), rounded down

For practical floating-point formats, this range is very small, as can be seen in Table 4.2 for single − 2wE 1 w −1 and double precision. Analytically, we have ω < 2 , therefore Mp < 2 E , where wE is the exponent bit width of the given precision. Additionally, as for small values of p close to zero, we have ep ≈ 1+p+p2/2 and the exponential −w −2 will return 1 for all the input p smaller that 2 F , where wF is the mantissa bit-width. Given the exponential property:

ea+b = ea × eb (4.2) 4.2 Range and error analysis 95

− − the bits of the input number corresponding to an exponent smaller that 2 wF 2 will have no impact on the result. Combining these limits with the reduced input range, the exponential function in floating-point format will only have resolution for a reduced number of bits. In the case of the most common pre- cisions, resolution comprises bits with values between 2−25 and 26 for single precision and between 2−54 and 29 for double precision. For the logarithm function, as exponential inverse function, the analysis is exactly the same except that inputs for outputs are exchanged. Looking at the operation ranges from Table 4.2 and equation ( 4.1), it is clear that the straightfor- ward architecture has to be completed with some extra component to adapt the (0, ∞) range of the logarithm to the range of the exponentiation function which includes the full range, (−∞, ∞), for both the base and the exponent. For the case of a negative base and an integer exponent the result is a real number for the pow(x,y) and pown(x,n) functions. However, the straightforward im- plementation will lead to a Not a Number result, making necessary to handle this case as a special case.

4.2.2. General Error analysis

−w Let us define our total error bound as ϵtotal = 1u where u = 2 F is the relative value of the ulp, and our goal is to ensure that ϵtotal < ϵtotal. The total relative error of the operator is defined by

R − xy R ϵ = = − 1. (4.3) total xy xy where R is the result obtained and xy represents the exact result. We compute xy using the formula xy = ey×ln x, implemented as three sub-operators: a logarithm, a multiplication, and an exponential. Obviously, the errors introduced by each sub-operator spread trough the equation harming the accuracy of the exponentiation unit, and we have to control this issue. One possible solution is to use tailored sub-operators with inputs and/or outputs with extended precision to generate more accurate partial results with more bits of precision so their relative error is reduced and hence the global relative error. We will note mul, E, and Ln the implementations of the multiplication, exponential and logarithm used here. They will entail the following errors:

mul(a, b) ϵ (a, b) = − 1 < ϵ , (4.4) mul ab mul 96 Exponentiation Operator

E(x) ϵ (x) = − 1 < ϵ , (4.5) exp ex exp Ln(x) ϵ (x) = − 1 < ϵ . (4.6) log ln x log

The purpose of this error analysis is to define the relationship between ϵtotal, ϵlog, ϵexp, and ϵmul, and deduce from it the architectural parameters that will enable faithful rounding at a minimum cost. Rewriting ( 4.3), we obtain:

E(mul(y, Ln(x))) ϵ = − 1 total ey×ln x There, we have for the first sub-operator, the logarithm, and following equation ( 4.6):

Ln(x) = ln(x)(1 + ϵlog) while for the multiplier, equation ( 4.4):

mul(y, Ln(x)) = yLn(x)(1 + ϵmul)

= y ln(x)(1 + ϵlog)(1 + ϵmul)

= y ln(x)(1 + ϵlog + ϵmul + ϵlogϵmul) = y ln(x)

+ y ln(x)(ϵlog + ϵmul + ϵlogϵmul) and finally for the exponential:

R = E(mul(y, Ln(x))) mul(y,Ln(x)) = e (1 + ϵexp) y ln(x) + y ln(x)(ϵ +ϵ +ϵ ϵ ) = e log mul log mul (1 + ϵexp) y ln(x) y ln(x)(ϵ +ϵ +ϵ ϵ ) = e · e log mul log mul (1 + ϵexp) that lead to a final relative error for the computation of xy with equation ( 4.1):

y ln(x)(ϵ +ϵ +ϵ ϵ ) ϵtotal = e log mul log mul (1 + ϵexp) − 1 (4.7)

Following equation ( 4.7), in Table 4.3, ϵtotal is calculated for single and double exponentiation units using sub-operators with the same precisions (all ε are in ulps, being the maximum relative value of the ulp1 in the column ε − ulp). As the exponential and logarithm operators we can find in

1When all the bits of the mantissa are zeros 4.2 Range and error analysis 97

Table 4.3: Powering function Relative Error (ulp).

ulp value ϵlog ϵmul ϵexp ϵtotal Best 0.5 0.5 89.22 Single 2−23 0.5 Worst 1 1 134.08 Best 0.5 0.5 710.05 Double 2−52 0.5 Worst 1 1 1065.5 the literature are not exact implementations, depending on the maximum relative error range of their faithful rounding two scenarios have been considered:

• Best case: with maximum ϵlog and ϵexp equal to 0.5 ulp.

• Worst case: with maximum ϵlog and ϵexp equal to 1 ulp.

In both cases, the multiplier operand can be exactly computed, and round to nearest applied.

Given the Mp values (exponential input range in Table 4.2) we can obtain the magnitude of ϵtotal introduced by the exponentiation implementation with sub-operators whose precision is not extended.

For single precision, ϵtotal is comprised between 89.22 and 134.08 ulps affecting up to the eight lsb bits of the significand or in terms of maximum absolute error, implying that we can reach an error up to 1.06 ∗ 10−3% of the result.

For double precision, the ϵtotal, in terms of ulps, increases due to its higher Mp, reaching 1065.5 ulps in the worst case and affecting up to the 11 lsbs of the significand of the result. However, due to its higher precision the impact of each ulp is much smaller and the maximum absolute error now is reduced to 2.36 ∗ 10−11%.

4.2.3. Error Analysis for accurate xy

In equation (4.7), all the terms in the exponential must be small compared to 1, otherwise the operator could not be accurate. Therefore, using the Taylor series, ez ≈ 1 + z + z2/2, this equation becomes

ϵtotal = y ln(x)ϵlog + y ln(x)ϵmul + ϵexp + ϵorder2 (4.8) where ϵorder2 is a term that gathers all the terms of order 2 or higher in the development of (4.7). 2 Each of these terms is of the order of u , and for practical values of wF (i.e. wF > 8) we have

ϵorder2 < ϵorder2 = u/16 (a tighter bound is not needed).

Replacing all the other errors by their upper bounds in (4.8), using the bound Mp on y ln(x) defined in Section 4.2.1, we obtain

ϵtotal = Mpϵlog + Mpϵmul + ϵexp + ϵorder2 (4.9) 98 Exponentiation Operator

thus the bound of 1u on ϵtotal leads to the constraint

Mpϵlog + Mpϵmul + ϵexp < 1u − ϵorder2 = (1 − 1/16)u (4.10)

A rule of thumb for an efficient implementation is to try and balance the contributions to the total error of the three sub-operators, i.e. balance the impact of ϵmul, ϵexp and ϵlog in the previous equation. Indeed, if one sub-operator contributes a much smaller error than another, we should try to degrade its accuracy in order to save resources. This means here that we should aim at

Mpϵlog ≈ Mpϵmul ≈ ϵexp. (4.11)

Each of these terms is analyzed in the following.

4.2.3.1. Exponential error analysis

We start with the exponential that produces the output. Typically, a hardware implementation of the exponential uses internal guard bits to control it output accuracy, which is it expressed as

−gexp ϵexp = (1/2 + t.2 )u (4.12) where the 1/2 term is due to the final rounding, gexp is the number of guard bits that controls the accuracy of the internal datapath, and t is a factor that is determined by the error analysis of the chosen implementation. For example, t = 18 in [DdD07] and t = 7 in the FloPoCo implementation [dDP10] that will be used. Equation (4.10) now becomes

−gexp Mpϵlog + Mpϵmul + t.2 u < (1 − 1/2 − 1/16)u (4.13)

It should be noted that in publications related to the exponential, such as [DdD07] and [dDP10], the error analysis assumes an exact input to the exponential. In the implementation of xy, however, the input is mul(y, Ln(x)) which is not exact. Its error has been taken into account in the previous analysis. However, the exponential begins with a shift left of the mantissa by up to wE bit positions, which could scale up this error by up to 2wE . To avoid that, we have to modify the architecture of the − − exponential slightly so that the least significant bit after this shift has weight 2 wF gexp , ensuring that the rest of the error analysis of the exponential remains valid. Instead of using a wF mantissa input for the exponential, a wF + wE + gexp bits input has to be used for the product mul(y, Ln(x)). 4.2 Range and error analysis 99

4.2.3.2. Logarithm error analysis

−gexp To ensure equation( 4.11), we have to implement a logarithm with a relative error ϵlog ≈ 2 u/Mp.

The simplest way is to use a faithful logarithm for a (wF + wE − 1 + gexp)-bit mantissa. Its error will −wE+1−gexp w −1 −gexp be bounded by 2 u, and as Mp < 2 E we have Mpϵlog < 2 u. Architecturally, the mantissa of the input x is simply padded right with zeroes, then fed to this logarithm unit.

4.2.3.3. Multiplier error analysis

We first remark that this multiplier has asymmetrical input widths: y is the input, and has a wF -bit mantissa. Ln(x) is slightly larger, we just saw that its mantissa will have wF + wE − 1 + gexp bits.

Multiplying these two inputs leads to a (2wF +wE −1+gexp)-bit mantissa which may be computed exactly. Three options are achievable:

• Round this product to wF + wE − 1 + gexp bits, which (as for the log) entails a relative error −gexp−1 such that Mpϵmul < 2 u. • Truncate the multiplier result instead of rounding it, which saves a cycle but doubles the error. − − • Use a truncated multiplier [WSM01, BdDPT10] faithful to 2 wE gexp u.

This last option will entail the lowest resource consumption, especially in terms of embedded multipliers. In addition, the output mantissa size of this multiplier perfectly matches the extended input to the exponential unit discussed above. This is the choice made for the current implementation.

4.2.3.4. Error Analysis Summing up

With the implementation choices detailed above, we have only one parameter left: gexp, and equation (4.13) now becomes

− − − 2 gexp u + 2 gexp u + t.2 gexp u < (1 − 1/2 − 1/16)u (4.14)

which defines the constraint on gexp:

− 0.4375 gexp > log2 t+2 (4.15)

With the chosen implementation of the exponential [dDP10](t = 7), we deduce that gexp = 4 ensures faithful rounding. A larger gexp may also be used to obtain a higher percentage of correctly rounded results. 100 Exponentiation Operator

4.3. Variable precision implementation with FloPoCo

This analysis has been integrated in the FloPoCo tool [dDKP09]. FloPoCo is an open-source core generator framework that is well suited for the construction of complex pipelines such as the exponen- tiation unit described here. Its salient feature, compared to traditional design flows based on VHDL and vendor cores, is to enable the modular design of flexible pipelines targeting a user-specified fre- quency on a user-specified target FPGA. The presented error analysis presented is relatively independent of the implementation of the underlying exponential, logarithm and multiplier cores. However, the selected implementations have been the ones which are already implemented in FloPoCo. This is due to two reasons. On one hand, the easiness of integration, as they are already available on FloPoco and no extra work is required. On the other hand, and much more important, the features of the exponential and the logarithm provided by FloPoCo. Their combination of features match those required for the implementation of our variable exponentiation [dDP10, dD10]:

• They are scaled to double precision contrary to the exponential and logarithm analyzed in the previous chapter [DdD07].

• They are pipelined as opposed to [DdDP07b].

• They are open as opposite to the Altera cores [Alt08a, Alt08b], which are the other option which fulfills the other features.

The selected logarithm and exponential functions are currently under research to improve some of their features, like the improvement of their pipeline datapath or the reduction of resources in parts of their algorithms. Hence, the results obtained in Section 4.4 are expected to improve as these sub-components themselves are improved. With respect to the multiplier, the ones provided by FloPoCo have been used as well. Figure 4.2 depicts the architecture for the exponentiation function unit xy. As can be seen, an exceptions unit completes the logic, while only positive bases are fed to the logarithm. This combina- tion of features avoids obtaining a Not a Number result at the logarithm unit for the case of negative base and integer exponent. As seen in the figure, each sub-operator has different precisions as exposed in the error analysis. Even each input of the multiplier has its own precision, to allow the improvement of each operation’s relative error. Finally, the result obtained from the exponential function is rounded and formatted following the exception handling determined by the exceptions unit. 4.3 Variable precision implementation with FloPoCo 101

Figure 4.2: Power function architecture.

4.3.1. Logarithm

This is the most time- and resource-consuming part of the operator. The problem is that in the used architecture [DdDP07b, dD10] each iteration involves reading a table with a k-bit input, and perform- ing a multiplication of a large number with a k-bit number. Values of k close to 10 should be used to match the table size with the embedded RAM size, but then the embedded multipliers (18×18-bits, or larger) are not fully exploited. A value of k = 18 is optimal for multiplier usage, but leads to prohibitively large tables. In the present FloPoco architecture, a k between 10 and 12 is used as a trade-off, but it can be concluded that this iterative range reduction, designed before the arrival of embedded multipliers and RAM blocks, is not well suited to them. Future FloPoCo releases may include the replacement of these iterations by polynomial approxi- mation designed for embedded memories and multipliers [dDJP10] to improve the logarithm.

4.3.2. Multiplier

As mentioned in Section 4.2.2, we need a rectangular multiplier, and we have the option of using a truncated version [WSM01, BdDPT10] to save DSP blocks. The saving is only noticeable for large precisions so at present day standard round to nearest multipliers are still used.

4.3.3. Exponential

The exponential currently used [dDP10] is based on a table-based reduction [Tan89] followed by a polynomial approximation [dDJP10]. The single precision version consumes only one 18×18 multi- 102 Exponentiation Operator

Table 4.4: Synthesis results for Virtex-4 (4vfx100ff1152-12) for pow function. precision performance resources (wE , wF ) cycles MHz slices BRAMs DSP48 [ELV08] (8,23) 34 210 1,508 3 13 8,23 19 99 1,149 11 13 8,23 32 200 1,249 11 13 8,23 57 262 1,726 11 13 11,52 42 101 4,337 28 54 11,52 70 157 4,380 30 54 11,52 117 262 6,096 30 54 plier and a 18Kbit RAM, which matches very well current FPGAs both from Xilinx and Altera. For larger precisions, the resource consumption remains moderate, as shown in Table 4.5. Compared to a standard exponential unit, there have been two modifications. At the input, the precision is extended from the standard wF to wF + wE + gexp, as detailed in Section4.2.3.1. At the output, information from the exceptions unit is taken into account.

4.3.4. Exceptions Unit

This unit is in charge of analyzing the types and values of the input data to detect the exception cases summarized in Table 4.1. As seen in Section 4.1, the standard [IEE08] defines three possible implementations for the power function and we implement the two most general ones, pow and powr. In addition to the exceptions related to infinity, zeros, NaN, or x = 1, this unit is in charge of detecting if y is an integer, as in this case a negative x is allowed for the pow function. This resumes to determining the binary weight of the least significant ’1’ of the mantissa of y. Let’s define e:

e = Ey − E0 − wF + z where Ey is y’s exponent, E0 the bias and z is the count of ’0’ bits to the right of the rightmost ’1’ bit in y’s mantissa. If e is negative, y is fractional. Otherwise, y is an integer, and we have to extract its parity bit to determine if y is even or odd.

4.4. Experimental Results

Tables 4.4 and 4.5 summarize the experimental results obtained for the two most common used floating-point precisions, single and double. The results of the generated architectures for three target frequencies (120, 200, 350 MHz) are shown for pow(x,y) in Table 4.4 . Meanwhile, in Table 4.5 the result for the sub-component are summarized for the target frequency of 200 MHz. 4.4 Experimental Results 103

Table 4.5: Separate synthesis results for the sub-component (targeting 200MHz).

Operator(wE , wF ) latency freq slices BRAM DSP FPPowr(8, 23) 32 200 1249 11 13 FPLog(8, 33) 16 196 798 10 9 FPMult(8, 23×33→34) 4 294 109 0 4 FPExp(8,23) 9 243 390 1 1 FPPowr(11,52) 70 157 4380 30 54 FPLog(11, 66) 34 157 2,491 25 24 FPMult(11, 52×66→67) 7 263 496 0 16 FPExp(11,52) 26 230 1,253 5 14

Table 4.6: Separate synthesis results for Exception unit. (8, 23) (11, 52) pow powr pow powr Slices 23 9 111 13

4.4.1. Results Analysis

Several facts can be remarked. First, the extra precision needed for the operators. For the logarithm 10 extra bits for single precision and 14 bits for double. For the exponential operators, in addition to the four extra guard bits for its internal architecture, the input requires 11 (single) and 15 (double) extra bits. Second, the large pipelines required due to the complexity of the logarithm and the exponential operators. At this point, it has to be taken into account that the FloPoCo project is under development and the pipeline is generated automatically with general rules. In this way, a careful hand-tuning of the datapath can reduce the number of stages while improving the frequency. In the same way, the automatic generation of the pipeline subcomponents is responsible for not achieving the desired frequency (see the 350 MHz implementations).

4.4.2. Comparison with previous work

Table 4.4 summarizes the results from another approximation we have carried out [ELV08] (con- figuring it with the same features as FloPoCo operators). In this approximation, we develop a first exponentiation operator, just for single precision, based on the exponential and logarithm units from Chapter 3. This approximation is also based on the extension of the precision of the partial results but it is restricted to the use of the guard bits of those single precision operators and therefore it does not ensure one u. This implementation can be directly compared with the single precision implementation targeting 200 MHz due to their similar working frequencies. The results from both units are very similar only differing clearly in the use of BRAMs (due to the different algorithms used for the logarithm and the 104 Exponentiation Operator

Table 4.7: Exception’s control results. 8 23 11 52 pow powr pow powr Slices 23 9 111 13 exponential functions), although [ELV08] was hand-tuned looking for the optimal implementations for single precision of each subcomponent and for an optimal and balanced pipeline. Furthermore, in the comparison [ELV08] it is also benefited from the fact that it uses a logarithm with less input/output precision, (8,23), instead of the values used now (8,33). Therefore, this variable precision architecture, in addition to being accurate, it can be generated with FloPoCo for any precision and it is also more efficient and with better performance than [ELV08].

4.4.3. Exceptions Unit

Finally, Table 4.7 summarizes the cost of the exception’s control for pow(x,y) and powr(x,y) which is the only difference between both operators. The largest cost of resources for pow(x,y) comes from determining if y is an integer. This operation implies determining the weight of the least significant ’1’ of the y’s mantissa: search for that ’1’ and obtain its weight given y’s exponent. Therefore, its resources usage is highly dependent on the mantissa bit-width, mainly due to the search of the least significant one, as can be seen in Table 4.7 where the slices required are more than quadrupled for pow(x,y) and double precision.

4.5. Conclusions

The availability of advanced functions for FPGAs is essential for developing hardware co-processors to enhance the performance of heavy-computational applications such as scientific computing or fi- nancial or physics simulations. In this chapter, an accurate exponentiation operator for FPGAs has been presented. The developed exponentiation operator, based on the straightforward translation of xy into a chain of sub-operators, relies on the FPGA flexibility which allows tailored precisions. Taking advantage of this flexibility, the provided error analysis focuses on determining which precisions are needed in the partial results and in the internal architectures of the sub-operators to obtain an accurate operator with a maximum error of one u. Finally, the integration of this error analysis and the development of the operator within the FloPoCo project has allowed to automatize the generation of exponentiation operators with variable precisions. 4.5 Conclusions 105

An initial work on the exponentiation operator was publised in [ELV08]:

• Pedro Echeverría and Marisa López-Vallejo. An FPGA implementation of the powering func- tion with single precision floating point arithmetic. In 8th Conference on Real Numbers and Computers, pages 17–26, 2008.

A journal paper (P. Echeverría, F. de Dinechin, B. Pasca, and M. López-Vallejo. Flexible floating- point power function for reconfigurable architectures) is currently under development.

5

LIBOR Market Model Hardware Core

As presented in the introductory chapter, financial simulation is one of the fields where the use of Monte Carlo simulation is more extended. In this PhD we have selected one of the best known financial models based on the Monte Carlo approach, the LIBOR1 Market Model (LMM) [BGM97], as validation application for hardware acceleration. This model is an ideal example due to several reasons:

• The complex equations with floating-point arithmetic it requires which imply high computa- tional load and the use of sophisticated operators. • The need of high quality random variables. • The use of variance reduction techniques is required to diminish the total simulation time. • The high accuracy it demands. • The complexity introduced due to different scenarios requiring a complex control. • The complex data dependencies that it has.

1London Interbank offered rate. 108 LIBOR Market Model Hardware Core

• The impact of the parameterization of the model in the core architecture.

• It can be considered as a complete model or be part of a bigger simulation.

All these features allow to study different aspects of hardware acceleration and analyze the dif- ferent implications of using hardware acceleration. Furthermore, this model allows us to thoroughly check if complex financial simulations can be accelerated with FPGAs and with what results. Previ- ous works on Monte Carlo financial simulations which were mentioned in Section 1.1.2.1, focus on models which either are not very complex or are simplifications of real models. In our case we focus on a much more complex model and without simplifications. With the LMM implementation, two objectives are searched. Firstly, a general objective, and using the LMM as benchmark, which is the study and analysis of hardware acceleration with FPGAs. And, as second main objective, we have analized if complex financial simulations are suited for FPGA acceleration. Therefore, in this chapter we study the implications related to the hardware implementation of the model itself as they are how the model is adapted to hardware, which are the elements of the the model that have an impact on the final performance, development of a performance-oriented architecture, etc. Other objective that can be outstood is the study of which precision is enough to implement a complex application as the LMM. We leave for the next chapter the implications related to the integration of a hardware accelerator within a software application.

5.1. LIBOR Market Model

The LMM, also known as BGM Model (Brace Gatarek Musiela Model), is a financial model of interest rates for pricing interest rate derivatives. LMM defines the time evolution for the curve of interest rates according to no-arbitrage settings. The main variables of the model, see Figure 5.1, are the LIBOR market forward rates, Li(t, Ti,Ti+1) or LIBORs. These rates are reference rates, in the interest rate market, for the Forward Rate Agreements: contracts at time t to borrow money from Ti,

fixing (also known as reset) date, to Ti+1, maturity date, with the simple interest rate determined by

Li. Furthermore, from the LIBORs it is possible to calculate all other variables of the interest rate curve that can be needed. The LMM assumes that the evolution of each forward rate is lognormal and that each forward rate has a time dependent volatility (a measure of the variation of the price of a financial instrument over time) and time dependent correlations with the other forward rates (how the price of an instrument affects the price of another instrument). After specifying these volatilities, σi(t), and correlations,

ρij, the LIBORs evolve following the equation: 5.1 LIBOR Market Model 109

Figure 5.1: LIBOR Forward Rates.

dLi(t, Ti,Ti+1) = µi(t)dt + σi(t)dWi(t) (5.1) Li(t, Ti,Ti+1) with i = 0, ..., N − 1, where N is the maturity date of the last LIBOR, and dWi(t) is a standard Wiener process (also known as Brownian motion): a time-continuous stochastic process (the random component of the model) related with the correlations between LIBORs, ρij:

E[dWi(t) · dWj(t)] = ρijdt (5.2) where the different ρij compose the correlation matrix of the model, [C] = [ρij] with ρii = 1.

In equation ( 5.1), besides the LIBORs Li(t) and the brownian motion dWi(t), we find the other two main components of the model:

• µi(t) is the drift. It represents the change in the average value of Li(t) and it is determined by no-arbitrage conditions.

• σi(t) is the volatility of each LIBOR (its values are determined by a calibration of the model).

If small time steps are considered, see Figure 5.2, equation ( 5.1) can be resolved with stochastic methods relying on the Monte Carlo method while the brownian motion, now Ni(tk), is obtained from gaussian random variables:

√ Li(tk+1) = Li(tk) + µi(tk) △ tk + σi(tk)Ni(tk) △tk, i = 0, ..., N − 1 (5.3) for S simulation times, k = 0, ..., s − 1, being tk the simulation dates.

σi(tk) and ρij values are determined by the calibration of the model, based on market data at present and past dates, and their values are known before starting the simulation of the stochastic model. Meanwhile, µi(tk) is state dependent and has to be calculated for each time and each LIBOR 110 LIBOR Market Model Hardware Core

Figure 5.2: Monte Carlo LIBORs simulation.

as its value depends on Li(tk). Using the Ito’s lemma [Ito51], µi(tk) follows the equations:  ∑  Q−1 τj Lj (tk)  −σi(tk) σj(tk)ρij, 0 ≤ i < Q − 1  j=i+1 1+τj Lj (tk) µi(tk) = 0, i = Q − 1 (5.4)  ∑  i τj Lj (tk) σi(tk) σj(tk)ρij,Q − 1 < i ≤ N − 1 j=Q 1+τj Lj (tk) where τj is the accrual factor of Lj, the number of days where Lj is alive divided by the days of a year.

Index Q ∈ [1, ..., N] is the maturity day, TQ, of the zero coupon (a debt security which represents a financial value and referred to a type of interest rates) used as numeraire (the reference value for the simulation), being the maturity date of one LIBORs, LQ−1, and the last simulation date, ts−1, see Figure 5.2.

The randomness of the model can be found in the Ni(tk) component. For each simulation step we need N random variables2 with Gaussian marginal distribution N(0,1) and a joint multivariate distribution with correlation matrix [C]:

N∑−1 Ni(tk) = eai,jN(0, 1)j(tk); (5.5) j=0 where eai,j are the coefficients resulting from the decomposition of matrix [C] into a pseudo square root matrix [Ae] via the Cholesky method:

[C] = [Ae][Ae]T (5.6)

5.1.1. LIBOR Market Model as base to compute financial products

The LMM itself is a complete model that obtains a curve of interest rates through time represented by the LIBORs. This curve is used to value at the present day different financial products whose underlying variables are the interest rates at future dates. The valuation of these products follows the Monte Carlo simulation of the LMM: for each replication of the LMM a different curve of interest

2One Gaussian variable per LIBOR. 5.2 Model Analysis 111 rate is obtained and the product is valuated for that curve. The final value of the product will be obtained from some statistical measure of the product valuations as the arithmetic mean. There is a great variety of products that are valuated relying on the LMM, from very simple ones to others extremely complex when considering the calculations they require. Furthermore, and unlike the LMM itself which is a very stable model, the equations for those products can vary frequently. However, some common equations for a wide range of products can be found. The most repre- sentative one is the numeraire valuation at the main dates of the simulation, N(Ti), which is the first step to value some products:

Q−1 ∏ 1 N(T ) = ; (5.7) i 1 + τ L (T ) j=i j j i

Then, the measures of the numeraire are used to value the pay-offs of the product, Pi, and from them the value of the product known as net present value, NPV. For example, one of the most simplest product, a cap [Cap]:

max{Li(Ti) − K, 0} Pi = , i = 1, ..., R (5.8) N(Ti)

∑R α NPV = (τiPi) (5.9) i=1 where there are R payments in Ti dates which coincide with the fixing dates of the simulated LIBORs and K is the strike, the cap value of the product and α is one or a value close to one. As mentioned above there is a wide variety of products. It is out of the scope of this Thesis to study the different products valuations in depth, although we are going to consider this example, the cap, for the sake of illustration of a more complete model with the addition of a post-processing stage to the LMM. In the next chapter we study more in detail how this post-processing affects to the whole accelerator.

5.2. Model Analysis

The detailed equations in Section 5.1 are the core of the LMM and are the base of the hardware implementation we have carried out, while the equations of section 5.1.1 can be considered an op- tional module. However a global perspective of the model will not be complete if, additionally to the equations themselves, other facts are not taken into account as:

• The range of the variables which determine the computational load of the simulation.

• Simplifications that can be introduced to the model. 112 LIBOR Market Model Hardware Core

• Complexity of the operators involved.

• Data dependencies.

5.2.1. Variables’ Range

Having a first look to the equations of the model it can be observed that besides the complexity of the equations and the number of replications required, the computational load of the model is determined by the number of LIBORs to simulate, N, and the number of simulation times, S. For each replication of the model, from now on path, there are S − 1 time steps to simulate3, with N LIBORs, N drifts and N correlated gaussian variables. To have an idea of the number of variables that the model can handle per path, two examples with the largest values of the variables can be used. For the S variable, a 25 year simulation with weekly time steps can be considered, resulting in 1303 time steps. And for N, we can consider monthly forward rates during 30 years, requiring 360 LIBORs, drifts and gaussian variables per time step. Combining both will lead to a model with 469,080 LIBORs per replication. However, one circumstance lightens the computational load of each replication: for all the time steps not all the LIBORs have to be computed. As LIBORs reach their fixing dates, their value is fixed and there is no need to compute again those LIBORs. Thereby, the number of LIBORs per time to be computed, Nact (active LIBORs), is decreased by one each time a fixing date is reached in the path. Again, one real example can help us to understand this impact: for a simulation with 159 monthly LIBORs, and 144 weekly time steps there are a total of 22,896 LIBORs to be computed. However, as there is one fixing date per every four simulation dates approximately, the total number of LIBORs to be computed is reduced to 20,214 that corresponds to 11.9% less variables.

5.2.1.1. Use of Variance Reduction Techniques. Paths required

And as seen in Section 2.3.2, the result of a Monte Carlo simulation converges to the exact result when the number of replications tends to infinity, being required a huge number of paths to obtain the accuracy demanded in financial models. Due to the high computational load of each path, and to the huge number of paths required to obtain the desired accuracy, it is necessary to use variance reduction techniques to reduce the total amount of paths. In the LMM case we need random variables with Gaussian marginal distribution N(0,1) and joint multivariate distribution with correlation matrix [C]. Thereby, Latin Hypercube is the technique best suited for the LMM. In principle, an N-dimension Latin Hypercube will be needed, although factorization, see next section, considerably reduces the number of dimensions required. Meanwhile the number of stratas can be as many as wanted, being

3time 0 corresponds to the initial date where the value of the LIBORs are known. 5.2 Model Analysis 113 typical configuration values between 4 and 32 stratas. This fact has a direct impact on determining how to simulate due to the implications of stratified variables over memory storage, see Section 5.3. Already taking into account the use of Latin Hypercube, the number of paths that need to be simulated can be in the order of one or two hundred thousand.

5.2.2. Simplifications to the model: Factorization

As just seen, the number of LIBORs, N, can be in the order of hundreds. And consequently, the dimension of matrixes [C] and [Ae], N × N, and the number of elements of the summatories in equations ( 5.4), drift, and ( 5.5), brownian motion are also in the order of hundreds. However, a reduction factor can be applied with a value F < N, evolving the forward rates with F independent Wiener processes [Jae02]. In this case, a new coefficient matrix is obtained, [A], with dimension N ×F . Therefore, equation ( 5.5) can now be calculated with F gaussian random variables according to the new factor F :

F∑−1 Ni(tk) = ai,hN(0, 1)h(tk); (5.10) h=0

Additionally, with the decomposition of matrix [C] into [Ae] and the subsequent factorization of [Ae] into [A], the drift equation can be rewritten as:

 ∑ ∑  F −1 Q−1 τj Lj (tk)  −σi(tk) aih σj(tk)ajh, 0 ≤ i < Q − 1  h=0 j=i+1 1+τj Lj (tk) µi(tk) = 0, i = Q − 1 (5.11)  ∑ ∑  F −1 i τj Lj (tk) σi(tk) aih σj(tk)ajh,Q − 1 < i ≤ N − 1 h=0 j=Q 1+τj Lj (tk)

5.2.2.1. Equations Analysis. Factorization Impact

There are three variables to compute per time step and per LIBOR (Li(tk+1), µi(tk) and Ni(tk)), and only the Li(tk+1) equation can be quickly computed as it implies two additions and three multiplica- tions. The other two variables are computationally very intensive. The correlated gaussian variables equation is equivalent to a vector multiplication, requiring as many multiplications as vector elements and as many additions as vector elements minus one, when one result per cycle is targeted. The importance of using factorization is easily observed, since factor F is in the order of tens (10-20) while N could reach hundreds, meaning that, just for the calculation of each Ni(tk) per time step, factorization saves N − F multiplications and additions. Furthermore, it also reduces the required number of N(0, 1) random variables per time step from 114 LIBOR Market Model Hardware Core

N to F , reducing the computational load of all the components of the gaussian random generator and reducing the number of dimensions required for the Latin Hypercube (and the memory resources it needs).

With respect to the drift, σi(tk), analyzing equation ( 5.4) we observe a summatory with a num- ber of operands, which depends on the i index of the drift, that can be in the order of hundreds of operands. In software this situation implies and important computational cost, but in hardware two other problems arise: how to handle an equation with variable number of operands, and how that great number of operands can be efficiently managed on a hardware architecture. A first look to the alternative equation for the drift after factorization, equation ( 5.11), makes the problem even worse as a second summatory appears, now of F operands. However, when analyzed more in detail, it can be observed that in equation ( 5.11) the operands of the first summatory do not depend on a variable with index i. This makes possible the reuse of the partial results within different drifts when the index i is covered by an appropriate scheduling. Now the problem is reduced to the second summatory with a fixed number of operands, F , and with a value in the order of tens. In Section 5.4.3.2, the adopted solution is explained.

5.2.3. Operators’ complexity

Most of the operations in the equations are additions and multiplications. In software, these operations are very efficient as current CPUs contain dedicated silicon with those operations and floating-point arithmetic. However, we can find three more complex operations: division, square roots and expo- nentiation that have a major impact on software, as they are usually emulated through subroutines. As seen in Chapters 3 and 4, high performance implementations for all the operators can be achieved in FPGAs if deep pipelines are not a problem. Again, the adder and the multiplier are the most efficient operators in terms of resources and pipeline stages.

From the three complex√ operators, the square root can be avoided in the FPGA architecture. It is only used to compute ∆T and this can be done just once during the calibration of the model and stored in the memory of the FPGA. However the divisions and the exponentiation (in case we have a product valuation) are mandatory.

5.2.4. Model Summary

Summarizing the previous sections, a complete LMM is characterized by the following parameters:

• N: the number of LIBORs of the system. • Q: the index of the time corresponding to the numeraire. • S: the number of simulation times. 5.2 Model Analysis 115

Table 5.1: Parameters Range. N S F LH P Typical 100-200 hundreds 10-20 4-32 40000-200000 Maximum ≈ 360 ≈ 1300 20 32 400000

• F : the reduction factor. • LH: the number of the stratas of Latin Hypercube • P : the number of replications, paths, required. whose typical and maximum values are summarized in Table 5.1 based on real market simulations. Also, the following initial data are required:

• [L(0)]: the LIBORs in the actual date t = 0. • [τ]: the accrual factors. e • ρij([C], [A], [A]): the correlations between LIBORS and their associated correlation matrixes.

• [σ(tk)]: the volatility matrix. √ • [∆T ] and [ ∆T ]: the size of the time steps.

And finally, the following operations will be necessary:

• Addition. • Multiplication. • Division. • Exponentiation.

Hence, previously to develop an architecture it is necessary to determine the maximum values for these parameters: F , S, LH and N. In our case, and based on real market simulations, we have:

• N: 30 years and monthly forward rates which imply 360 LIBORs. • S: 25 year simulation with weekly time steps make a limit of 1303 time steps. • F : 20 factors of correlation. • LH: 32 Latin Hypercube.

5.2.5. Qualitative Profiling

The last part of the model analysis is related to if this model can be accelerated or not with an FPGA and how it can be accelerated. 116 LIBOR Market Model Hardware Core

From the equations of the model, the parameters range and the results obtained for the required Gaussian RNG in chapter 2 it easily deduced that the classic approach to Monte Carlo acceleration as is the computation in parallel of many independent replications cannot be achieved. On the one hand, the required Gaussian RNG requires many resources while the equations of the model imply summatories of multiplications that should be implemented in parallel and, thus, requiring a high number of operators (determined by variable F ) and therefore resources. Therefore, we have to focus in the acceleration of just one replication (maybe two replications in parallel). From a brief analysis of the model equations, acceleration can be achieved only by exploiting the parallelism of the model and datachains with many operators working at the same time, while a major concern should be resolved or minimized, the impact of data dependencies that can be seen as feedbacks.

5.2.5.1. Model Parallelism

Three degrees of parallelism can be exploited:

• Between model core and Gaussian RNG core. The RNG core has to feed the model with random numbers and no other dependency can be found between both cores. The developed gaussian RNG is capable of generating one Latin Hypercube while a previously generated one is being used allowing that both cores can work in parallel: the RNG works one time step in advance so its is preparing the Latin Hypercube for the next time step while the model core is using the random numbers of the actual time step. • Between variables of the LMM. The drift variable and the brownian motion are independent so its calculation can be done in parallel. • Within variables equations. The three main equations of the model have operations that can be done in parallel. Specially relevant are the correlations of factors of the brownian motion and the drift term.

5.2.6. Data dependencies

In the model two data dependencies can be found:

• Drift calculation at a given time requires the values of the active LIBORs at that time. • LIBOR calculation at a given time requires the LIBOR at the previous time.

In hardware, these dependencies can be seen as data feedbacks whose impact will be related to the number of pipelined stages between the stage where the data are calculated and the stage where that data are needed. 5.3 Adapting the model to Hardware 117

Figure 5.3: LMM Monte Carlo Simulation with Latin Hypercube.

However, an additional data dependency has a strong impact in the performance of the accelerator. As just mentioned in Section 5.2.2, the drift term implemented following equation ( 5.11) can skip the first summatory with the reuse of partial results, introducing a new data feedback. Thereby data dependencies will have a strong impact on the control of the hardware accelerator and on the working clock frequency.

5.3. Adapting the model to Hardware

5.3.1. Simulation order

The use of Latin Hypercube introduces dependencies among the random variables of a group of paths equal to the number of stratas used. From now on we will know this related paths as grouped path and each of its individual paths as a subpath. From now on, for all the time steps of the simulation, the random variables of the grouped paths are calculated together, and thereby replications are not handled alone, see Figure 5.3, where each subpath is denoted as LHi. Two options arise to handle the new situation:

• Independent simulation of each subpath. Before simulating the first subpath among a group, this type of simulation requires the generation of all the random variables for each time step and each subpath.

• Joint simulation of a group of subpaths. Subpaths are simulated together, so for each time step all the LIBORs are computed for all the subpaths. Now, the generation of random variables can be done for each time step. Two new options are now possible:

– To compute all the LIBORs from a subpath before continuing with the next subpath. 118 LIBOR Market Model Hardware Core

– To compute the same LIBOR for all the subpaths before continuing with the next LIBOR.

The main implication of choosing between independent simulation and joint simulation is the memory resources required and the design of the pipelined datapath. In the first scheme, F ∗ LH ∗

(S − 1) memory words are needed for the random variables, while Nact memory words are required for the feedback of LIBORs between two consecutive times. In the second case, we only need to store F ∗ LH random variables while now Nact ∗ LH LIBORs are required for the feedback. As embedded memory resources are scarce elements in FPGAs, the second option is better as it requires less memory. Furthermore, there is another important issue that should be taken into account: the impact that both schemes have in the pipelined datapath. As it will be seen in Section 5.4.3, a deep pipeline is required to achieve high performance. The data dependencies of the model will demand a minimum number of clock cycles between two time steps due to the feedback of LIBORs that will be equal to the feedback stages (F bstg from now on): the number of pipeline stages between the first place where a feedback LIBOR is used (drift) and the stage where it is calculated.

This circumstance implies that if there are less LIBORs per time to simulate than F bstg the hard- ware engine should wait for the feedback LIBORs to be computed, harming the throughput. As with the memory, the second scheme is better, as between two time steps there will be Nact ∗ LH LIBORs to be computed, while in the first scheme there will only be Nact.

5.3.2. Tailored Arithmetic

Financial simulation demands highly accurate results as any little deviation from the exact results can mean the loss of a large amount of money. Hence almost all financial simulations rely on floating- point arithmetic as this is the only one capable of ensuring the high accuracy expected with the simulation requirements:

• Wide value range of the numbers involved in the simulation. • High error propagation of intermediate results due to data and time dependencies.

And from the two most common floating-point precisions (single and double) financial simulation relies on double precision because with the 53-bit mantissa resolution the propagation error is smaller than with the single precision 24-bit mantissa. However, in FPGAs this precision has a high cost in terms of processing time, resources and complexity of the design solution, requiring at least twice the resources required for single precision, more pipeline stages, and achieving lower frequencies [Alta]. Furthermore, the accuracy required in the results of many applications can be achieved with single precision or a tailored precision which is much closer to single than to double precision [LLK+06]. Therefore, the ideal arithmetic to adopt has to offer an accuracy which ensures that the results it obtains can be considered equivalent to 5.4 FPGA Monte Carlo Libor Market Model Engine 119 results obtained with double precision but with a use of resources closer to those of single precision operators. In this way, a key factor for the engine architecture will be the arithmetic used and its precision as FPGAs work at bit level allowing the use of any tailored arithmetic. Choosing the adequate arithmetic (fixed-point or floating-point, logarithmic,...) with the adequate bit-width resolution is essential to obtain results with the expected accuracy while obtaining the best performance (less FPGA resources and higher clock frequencies). Hence, precision and area must be traded-off while still obtaining highly accurate results. How- ever, how do we decide which precision is enough for an application? A mathematical analysis of the maximum error introduced by reducing the precision will lead to a huge error due to the com- bination of three features of the model: first, the data feedback of the previous LIBORs to the drift calculation and to the next LIBOR calculation, second, the huge number of operations per equation (summatories) and third, the huge number of simulation times that can be involved per replication of the model. Additionally, any FPGA implementation will exploit the parallelism of the equations included in the mathematical model. Due to the non-associativity of floating-point operations the results obtained with parallel operations may differ from the ones obtained with an equivalent sequential implementa- tion [KD07], making even more unfeasible a mathematical analysis. In our case we have decided to choose the precision following an experimental methodology. We set an accuracy criterion: a maximum difference allowed among software obtained LIBORs and hardware obtained ones. Then we search for what precision fulfills the criteria. In Section 5.5.2 the criterion followed, how we measured it and the results obtained are exposed.

5.4. FPGA Monte Carlo Libor Market Model Engine

5.4.1. General Architecture

As introduced in Section 5.1, the Monte Carlo LIBOR Market Model simulation requires a calibration stage to determine the volatilities of the model, σi(t), and the correlation factors (and their related matrixes). This task, although complex and time consuming, is done only once and represents just a negligible part of the total simulation time. Thereby, it is more suitable to perform it in software. Additionally, calibration can be used to compute all the other variables that will remain constant √ through all the different P paths as △tk, △tk, τj, the values of the LIBORs at t0, and signal which simulation times correspond to a fixing or a maturity date of any LIBOR. All these data, with the volatilities σi(t) and the used correlation matrix must be stored in the FPGA for the simulation and this can be done using Block RAMs. 120 LIBOR Market Model Hardware Core

Figure 5.4: Engine Architecture.

Hence, the LMM engine will be composed of several memories storing the data which is constant through all the replications and the units implementing the repetitive tasks composing the mathemat- ical model and the random number generation. Figure 5.4 shows the top view of the engine. The RNG core provides the random variables needed with the gaussian distribution function required by the LMM core and with the statistical properties defined by the variance reduction technique employed, Latin Hypercube. Between the RNG core and the LMM core, where the LIBORs are calculated, an interface element is needed to adapt the way the RNG core provides random samples (one per clock cycle) and the way the LMM core uses them (as a vector of F variables, see Section 5.4.3.1). On the right side of the figure it is found the optional module, the product valuation, whose inputs are the computed LIBORs and their associated τi. These four elements are managed by a general control unit that also handles the synchronization, see Section 5.4.5, the storage of the data calculated during software calibration and the storage of the data which parameterizes the simulation of the model: N, Q, S, F , LH and P .

5.4.1.1. Implementation strategy

The main goal of any hardware accelerator is to carry out a computational task in the minimum pos- sible time, searching for the most efficient way to compute the task while doing it with the highest clock frequency. In a model without data dependencies this would imply the use of as many pipeline stages as possible to achieve a high clock rate. However, the design of an efficient hardware accel- erator has to take into account the data dependencies of the model if they exist as is the case of the LIBOR Market Model. As seen in Section 5.3.1, the data feedback of the LIBORs calculated in the previous time step makes necessary stalling the engine when the number of LIBORs per time step, Nact ∗LH, is smaller than the number of pipeline stages between the stages where the LIBOR is obtained and where it 5.4 FPGA Monte Carlo Libor Market Model Engine 121

is needed, F bstg. The importance of these stalls in the throughput is conditioned to the values of

F bstg and Nact ∗ LH, and in most cases, due to the high values of Nact ∗ LH these stalls are never produced or the impact is negligible. Hence, the pipeline could be as deep as wanted if it implies an improvement in the working frequency. However, two critical elements of the system limit the maximum achievable frequency: the work- ing frequency of the Latin Hypercube and the data dependency found in the drift. With respect to the Latin Hypercube, the copy of the permutation tables to the tables where the stratas are read limits the working frequency of the whole system due to the size and capacity of the tables needed and the requirement of copying all permuted stratas in just one cycle, see Section 2.5.3. On the other hand, due to the reuse of partial results in the drift calculation, Section 5.4.3.2, an accumulator is needed to replace the summatory in equation ( 5.11), appearing a data dependency between the output of the accumulator and one of its inputs. This situation implies that the floating-point addition of the accumulator must be carried out in just one cycle, avoiding the pipelining of this operator. Due to this last limitation the acceleration will not be caused by achieving a high frequency. It will be provided by achieving a pipelined datapath capable of calculating one LIBOR per cycle with a high number of operators and elements working in parallel. To achieve this datapath it is needed that the RNG core and the LMM core (plus the product valuation) work in parallel, implying that while the LMM core is computing the LIBORs for a time step, the RNG core is feeding the LMM core with the random variables needed for that time step and calculating the Latin Hypercube for the random samples of the next time step.

5.4.1.2. Adaptable Architecture

In hardware, the parameterization due to N, Q, S, F , LH and P has a direct impact on the architec- ture of the engine itself:

1. F determines the number of elements of the summatories of equations ( 5.10) and ( 5.11).

2. F , N and S determine the number of words needed to store the calibration data and the other data common to all paths.

3. LH and N determine the number of words needed to store the LIBORs for feedbacks.

4. LH and F determine the words of the Latin Hypercube tables and their configuration.

The first and the fourth facts are the facts which have a bigger impact on the architecture. The number of elements of the summatories determined by F has a direct translation into the architec- ture of the engine when those summatories are handled with partial additions in parallel, see Sec- tion 5.4.3.1. Additionally, the synchronization of the RNG core with the LMM core and the control signals also depend on F as the LMM core requires of the F gaussian samples at the same time. 122 LIBOR Market Model Hardware Core

Meanwhile, the fourth fact determines the resources required by the gaussian RNG while affecting to its performance, as studied in Section 2.5.3. Hence, the values of the parameters affect the design of the architecture itself and have a direct impact on the resources needed for storage and the Latin Hypercube. However, the hardware ac- celerator must be able of simulating any set of parameters. In this way, it is necessary to develop an architecture whose datapath and components are flexible enough to adapt to the different sets of parameters. To meet this feature our approach has been to develop a maximum bound architecture with a control mechanism: an architecture suitable for the implementation of a simulation whose parameters equal their maximum values, in combination with the necessary logic and control mechanisms to adapt all the elements affected by the parameters to simulate smaller values. Hence, previously to develop an architecture it is necessary to determine the maximum values for those parameters: F , S, LH and N. In this way our architecture will be designed according to the maximum values that were summarized in Table 5.1.

5.4.2. Gaussian RNG Core

F random numbers with normal distribution are required, N(0, 1)j(tk), j = 0, ..., F − 1, to compute the random component Ni(tk) required by the model see equation ( 5.10). The gaussian RNG with Latin Hypercube developed in Chapter 2 will be used as base RNG core as it fulfills the main features that are required for the LMM accelerator. Four criteria were considered to decide which generation method is better suited for a RNG and Monte Carlo acceleration exposed in Section 2.3.2. In the case of the LIBOR Market Model engine three of those criteria are essential. First, the compatibility with variance reduction techniques is a must as the computational load of the model in combination with the huge number of paths needed to obtain the desired accuracy makes necessary the use of these techniques to reduce the total number of paths needed and hence the total simulation time. Second, the capability of generating one sample per cycle is essential to obtain the maximum throughput from the engine. The LMM core is able of computing one LIBOR per cycle, see Sec- tion 5.4.3, and in case there is no factorization, it would require one gaussian sample per cycle. When factorization is applied, only F gaussian variables are needed for each time step of each subpath, and it can be considered that one random sample would not be necessary. However, as a simulation is reaching the final time steps, the number of remaining LIBORs to be computed, Nact, can be equal or smaller than F . In this case the LMM engine should be stalled to wait until the required random numbers from the RNG core are computed. 5.4 FPGA Monte Carlo Libor Market Model Engine 123

Third, the high accuracy demanded in financial simulations requires of high quality random sam- ples being the floating-point arithmetic the one most suited to provide this quality while developing the LMM core. The RNG has to provide floating-point samples as the LMM core is implemented using this arithmetic due to the accuracy demanded and the value range handled. With respect to the last criterion that was considered, a high clock rate or the possibility of pipelin- ing to achieve it, it will not be essential in the case of the LMM as the LMM core will be designed to compute one LIBOR per cycle. Due the way the drift is computed, an accumulator is needed and it will limit the working clock cycle of the whole system, see Section 5.4.3.2.

5.4.2.1. Random sequences reproducibility

When developing a RNG for a financial application or any other application, a practical feature has to be considered, the random samples generated must be computed on demand while the core should not be working when random samples are not needed. There are two reasons for it, the reproducibility of results and low-power consumption. The outcome for any application should be reproducible to ensure that the obtained results are correct, and in particular as we are dealing with a hardware accelerator, the outcome should be reproducible to ensure that the same results are obtained in hardware and software and to make easier the debugging of the hardware. Thereby, random number generation must be fully synchronized between hardware and software having exactly the same base random sequences (the uniform sequence for the gaussian generation and the uniform sequence for the random index for the permutation of the Latin Hypercube). This implies the need of providing the logic necessary for stalling all the logic. The option of letting the base random generator work all the time, providing that the random numbers generated are discarded while they are not needed, will make the hardware and software results differ.

When using factorization and F is smaller than Nact there are less random samples needed than

LIBORs to compute, so the hardware base random number generator will continue working Nact −F cycles, generating sequences of Nact − F samples which are not used and breaking the synchroniza- tion with software and hence requiring stalls. With respect to low power issues, stalling the RNG while random numbers are not needed avoids the dynamic power consumption associated to the stalled operations, decreasing the total power con- sumed by the system.

5.4.3. LMM Core

The LMM core is basically a datapath in charge of calculating equations ( 5.3), ( 5.10) and ( 5.11). It is based on floating-point operators and relies on storage elements, memories and registers, for the 124 LIBOR Market Model Hardware Core

Figure 5.5: LMM Core unit model inputs, the calibration data and the feedback data, while its control signals are provided by an external control unit. Figure 5.5 shows the general architecture of the core with its main components, the calculation units and the main storage elements. Five storage elements are required for the initial and calibration data while another storage element is needed to store the LIBORs calculated as they are going to be used in the next time step.

5.4.3.1. Correlation: Random Variables and Drift

The correlation of the gaussian variables, equation ( 5.10), corresponds to the basic linear algebra subroutine of the vector multiplication. The same subroutine can be found in the drift equation 5.11 when the second summatory is reduced to a factor depending on the index of the first summatory.

In both equations, for each LIBOR to compute, F factors, from now on Zh, are correlated accord- ing to the corresponding row of the reduced correlation matrix [A]:

F∑−1 ai,hZh(tk)); (5.12) h=0 with each calculated variable requiring F multiplications and F − 1 additions. Two possible architectures can handle this equation while obtaining one variable per clock cycle, a parallel architecture and a sequential one. The first option, the parallel vector multiplication, Figure 5.6, implies the minimum logic depth of the architecture, 1 + log2(F ) levels of operators. The F multiplications are carried out in parallel and a tree of adders reduce the multiplier’s results per pairs. In this case, the FZh factors are required in the same cycle. The second option, the serial vector multiplication, Figure 5.7, implies a first level composed of two multiplications, while the rest of levels imply one addition with a multiplication and a final level 5.4 FPGA Monte Carlo Libor Market Model Engine 125

Figure 5.6: Parallel Correlation.

Figure 5.7: Sequential Correlation.

with one addition, requiring F levels of operators. Now, the FZh factors are not required in the same cycle, only the first two are required in the same cycle. We have chosen the first option for the LMM engine due to its several advantages:

• The lower logic depth which implies less pipeline stages in the global datapath. The correlation of the drift is in the feedback datapath, having its architecture a direct impact on the value of

F bstg which is responsible of introducing stalls.

• The FZh from the drift are calculated in parallel, see next section. Meanwhile, the interface between the GRNG core and the LMM core control is in charge of providing the F gaussian variables in the same cycle, simplifying the design of the LMM core.

• Control and logic simplification. This option allows a homogeneous access to the row of the correlation matrix, getting a complete row with the same control signal. Meanwhile, the syn- chronization required between signals in the adders tree is minimum.

• The sequential implementation requires extra logic and more complicated control to correctly 126 LIBOR Market Model Hardware Core

synchronize all the signals (for example, registering the FZh factors of the drift).

In contrast with these advantages, only one possible drawback of choosing the parallel option with respect to the sequential option can be found: it cannot be ensured that the results obtained are exactly the same than in software. Due to the non-associativity of floating-point operations the results obtained with parallel operations may differ from the ones obtained with an equivalent sequential implementation. However, due to the use of a tailored precision instead of the standard precision in software, the result may differ from software even if the sequential option is used, and therefore there is no advantage for the hardware sequential implementation.

5.4.3.1. Adaptation to smaller F

The architecture is dependent on the maximum value of F as it determines the number of elements of the correlation and consequently the depth of the adders tree. The developed architecture is valid for the maximum value of F , but it is oversized for a smaller F that would not require all the operators. To adapt it to a smaller F , two options arise:

• The use of multiplexors and bypasses to have all the sub-architectures for smaller F s in the same architecture. An output multiplexor will finally select the correct result from the different paths of the sub-architectures.

• Forcing the operators that are not used to have a zero at their outputs. For a smaller F all not required operators will lead to additions with zero, so the result of the smaller F is unchanged in the computations with not needed operators.

We have opted for this second option as it has a considerable lower impact in the needed resources thanks to the use of operators with hardware flags: changing the flags of the not used ai,h not used to the flags corresponding to a zero number is enough to force zero outputs in all the not required operators.

5.4.3.2. Drift Calculation

The drift calculation is the key element of the LMM core. As mentioned in different sections, its complexity and how it is handled has a deep impact on the performance of the whole system (the accumulator restriction of completing the operation in just one cycle), and in the control of the engine (the scheduling of indexes to skip the second summatories). Additionally, the drift calculation mainly determines the depth of the LMM engine’s datapath. The drift equation has three terms depending on the index of the corresponding LIBORs with respect to the numeraire index Q. It is zero for the LIBOR previous to the one corresponding to the 5.4 FPGA Monte Carlo Libor Market Model Engine 127 numeraire, while the other two terms differ in the indexes of the second summatory and in the sign.  ∑ ∑  F −1 Q−1 τj Lj (tk)  −σi(tk) aih σj(tk)ajh, 0 ≤ i < Q − 1  h=0 j=i+1 1+τj Lj (tk) µi(tk) = 0, i = Q − 1 (5.13)  ∑ ∑  F −1 i τj Lj (tk) σi(tk) aih σj(tk)ajh,Q − 1 < i ≤ N − 1 h=0 j=Q 1+τj Lj (tk)

As just seen in the correlation section, once these second summatories, from now on Di,h(tk), are obtained, the drift equation is reduced to a vector multiplication and a multiplication by the volatility,

σi(tk). For the sake of simplicity, the following equations will not take into account this σi(tk) multiplication nor the different sign of the drift terms. In this way we have:

µ (t ) µ (t ) = ± i k ) i k (5.14) σi(tk) so we can rewrite ( 5.13) as:

F∑−1 µ (t ) = a D (t ) i k i,h i,h k (5.15) h=0

Apart from the operators, the complexity of the drift calculations comes from those second sum- matories (F for each LIBOR):

 ∑  Q−1 τj Lj (tk)  σj(tk)ajh, 0 ≤ i < Q − 1, 0 ≤ h < F  j=i+1 1+τj Lj (tk) Di,h(tk) = 0, i = Q − 1 (5.16)  ∑  i τj Lj (tk) σj(tk)ajh,Q − 1 < i ≤ N − 1, 0 ≤ h < F j=Q 1+τj Lj (tk)

The complexity of these equations is due to two facts:

1. The number of components of the summatories is variable and related to the value of i.

2. The dependency of the summatories on index h.

With respect to the different number of components of the summatories, a direct implementation of the summatories will imply an architecture (or a control) capable of handling different number of operators. However, as it has been introduced in previous sections, these summatories can be replaced by accumulators if an appropriate scheduling of the LIBORs in the simulation is carried out. For the two terms of the drift equation different from zero, the number of components of the summatories is incremental and starts by just one component (for i = Q − 2 and i = Q) and hence, no summatories are needed in these cases. For example, for i = Q we have: 128 LIBOR Market Model Hardware Core

τQLQ(tk) DQ,h(tk) = σQ(tk)aQh, 0 ≤ h < F (5.17) 1 + τQLQ(tk)

If next to i = Q we continue simulating with i = Q + 1, we will have summatories with two components, where the first one corresponds to the summatories of i = Q.

τQ+1LQ+1(tk) DQ+1,h(tk) = DQ,h(tk) + σQ+1(tk)aQ+1,h, 0 ≤ h < F (5.18) 1 + τQ+1LQ+1(tk)

The same happens with the following drifts:

τQ+2LQ+2(tk) DQ+2,h(tk) = DQ+1,h(tk) + σQ+2(tk)aQ+2,h, 0 ≤ h < F (5.19) 1 + τQ+2LQ+2(tk) as long as we continue simulating incrementing the index by one. For the other term of the drift, the same circumstance happens, but now starting by i = Q − 2 and decreasing the index i by one. These equations correspond to accumulators as we are adding a new result to the previous result, being the summatories computed in different steps. With respect to the other fact, the dependency of the summatories on the index h, for each LIBOR we have F components to compute. However, the index h only affects to one element of Di,h(tk), the ajh. As this element is multiplying all the other elements of Di,h(tk), we can rewrite Di,h(tk) in the following way:

∑ Di,h(tk) = di(tk)ajh (5.20)

τjLj(tk) di(tk) = σj(tk), j = i + 1 or j = i (5.21) 1 + τjLj(tk)

Considering the transformation of the summatory Di,h(tk) and this common factor di(tk) the main components of the drift equation (the two summatories) can be seen in a general way as:

F∑−1 µ (t ) = a ∗ (d (t ) ∗ a + D ± (t )), j = i + 1 or j = i i k ih i k jh i 1,h k (5.22) h=0

Summarizing, three elements can be found in this equation. First, the common component to all h, di(tk), and its multiplication by the F correlation factors ajh. Second, the accumulations substituting the second summatory, the additions with Di±1,h(tk). And third, the correlation of the drift components. 5.4 FPGA Monte Carlo Libor Market Model Engine 129

Figure 5.8: Drift Calculation.

Figure 5.8 depicts the architecture designed following this equation. The computation of the drift starts by the calculation of the common factor. Then, it is multiplied in parallel by all the ajh elements to compute in parallel the F components for the correlation. Afterwards, the accumulators replace the summatories, and finally the correlation is carried out as it was defined in the previous section. Additionally to the operators in the figure, it is necessary to add some extra logic to handle the reset of the accumulators when we have to swap between one term and another of the drift equation and to set the correct sign of the result.

5.4.3.3. LIBORs Calculation

The final computation of the LIBORs, equation ( 5.3), does not present any difficulty and its im- plementation corresponds to a direct translation of the equation parallelizing the operations when possible. Only one modification is introduced.

In the drift equation, the last operation to do is a multiplication by σi(tk). In equation ( 5.3) it can be seen that the random component is also multiplied by σi(tk), being a common factor. Hence, equation ( 5.3) can be rewritten:

√ L (t ) = L (t ) + σ (t ) ∗ (µ (t ) △ t + N (t ) △t ) i k+1 i k i k i k k i k k (5.23)

The proposed architecture is depicted in Figure 5.9. In a first stage the drift and the LIBOR are multiplied by their respective time factor. Second, the results of the multiplications are added so they can be multiplied by their common factor. Finally, the LIBOR obtained in the previous time step is added obtaining the LIBOR of the next time step. 130 LIBOR Market Model Hardware Core

Figure 5.9: LIBOR Calculation.

Figure 5.10: LMM datapth.

5.4.3.4. Complete datapath

The complete LMM engine datapath is depicted in figure 5.10. The calculations start with the com- mon part of all the summatories of the drift. After the accumulators of the drift the correlation of the gaussian variables and of the drift elements are carried out in parallel obtaining the random compo- nent of one LIBOR and its drift in the same cycle. Finally, the LIBOR for the next time step is is computed. Thanks to the accumulators, and providing that they obtain its addition result in just one cycle, the only remaining dependency is the feedback of the LIBORs for the next time step. Hence, while no stall is produced the datapath is capable of working at full capacity and obtaining one computed LIBOR per cycle as all the operators work in parallel. 5.4 FPGA Monte Carlo Libor Market Model Engine 131

Figure 5.11: Product Valuation Core.

The datapath, for the maximum value of F , 20, is composed of 65 multipliers, 61 adders and a divider, having 127 operations in parallel.

5.4.4. Product Valuation Core

The product valuation implementation for the selected example, a cap, can be seen in Figure 5.11. On the bottom left side of the figure we can find the numeraire valuation. A data dependency, analogous to the drift one, can be found but now with multiplications instead of additions.

Q−1 ∏ 1 N(T ) = (5.24) i (1 + τ (L (T ))) j=i j j i

The impact of this dependency is the same, the multiplication operator cannot be internally pipelined and it can last just one clock cycle. On the top left side, we have the comparison between the LIBOR and the strike of the product that afterwards is divided by the numeraire. These computations are only carried out for the LIBOR whose fixing date is the one corresponding to Ti.

max(Li(Ti) − K, 0) Pi = , i = 1, ..., R (5.25) N(Ti)

Finally, the payoff obtained is accumulated with all the previous ones to obtain the net present value.

∑R α NPV = (τiPi) (5.26) i=1 where the α value is 1 or very close to one. 132 LIBOR Market Model Hardware Core

5.4.5. Control Unit

The control core is in charge of generating all the control signals and the indexes to read or write in the different storage elements of both the RNG core and the LMM core. Additionally it controls the mechanisms to adapt the architecture to the different values of F , and it is in charge of synchronizing all elements. One key element of the synchronization is the stalls of the RNG core and the LMM core. These stalls are not only due to the architecture itself and the data feedbacks previously commented. Another source of stall can be found in a LMM simulation due to the existence of different scenarios

5.4.5.1. Simulation Scenarios and Stalls

The control must handle two different scenarios. At the beginning, t0, all the LIBORs remain active and F will be always smaller than Nact. But, as the simulation advances in time, Nact will start decreasing and eventually Nact can be smaller than F . Hence, two scenarios are defined:

• When F < Nact.

• When F > Nact. while F = Nact is the borderline between the two scenarios and implies no special treatment. Each scenario will require its own stalls affecting to different units. The origin of these stalls is the difference of values between F and Nact, as for the simulation of each time step F ∗ LH random variables are needed while Nact ∗ LH data are computed. Not all the stalls have the same importance as they will affect to different units. On the one hand, in the F < Nact scenario, the RNG has to be stalled to synchronize the generation of the random samples in the RNG core with their use in the LMM unit. These stalls affect just to the RNG, not harming the global throughput and their importance is related to results reproducibility and hardware- software synchronization, as it has been previously introduced.

On the other hand, if due to the combination of parameters the simulations enter in the F > Nact scenario, the LMM core needs to be stalled (F − Nact) ∗ HL cycles to allow the RNG to generate all the required samples. In this way, the throughput of the system is harmed as in this scenario we cannot calculate one LIBOR per clock cycle. This type of stalls, together with the architectural ones, are the most important stalls and limit the objective of calculating one data per cycle.

5.4.5.2. Control design

The control has been designed following three main guidelines: • It is oriented to obtain the highest possible performance, avoiding any stalls that are not strictly necessary, the architectural ones and the ones due to the simulation scenarios. 5.4 FPGA Monte Carlo Libor Market Model Engine 133

• To simplify the control, all the control signal are generated refering to the same stage. Then, to synchronize the signals with the pipelined architecture, buffers of registers are used to delay the control signals the necessary stages.

• Independent and synchronized sub-controls, one for the RNG core and another for the LMM plus the product valuation. In this way, the design of each part is simplified while the debug and testing of the hardware is made easier. Only one dependency cannot be avoided, the initial synchronization, as the LMM core cannot start computing until the RNG core is ready and has calculated the first Latin Hypercube.

With this scheme the control unit is composed of four elements, the RNG and LMM sub-control unit, a bank of registers to store the data which parameterize the simulation of the model (N, Q, S, F , LH and P ) and a memory element that stores the information about the simulation times as flags (’1’ for simulation times corresponding to fixing or maturity dates, ’0’ for the others).

5.4.5.2. RNG control unit

Its main task is to handle the synchronization with the LMM control unit and to control and synchro- nize three elements of the RNG:

• The Mersenne Twister URNG.

• The Variance reduction tables and their associated Tausworthe URNG.

• The interface element between the RNG core and the LMM core.

With respect to the first two elements, the RNG control unit integrates the control required by the Latin Hypercube, with the stall control required for both elements. The control for both elements also handles the start synchronization. As exposed in Section 2.5.1.2, the Mersenne Twister URNG requires of an initialization 624 cycles. Meanwhile, the Latin Hypercube needs F ∗ LH cycles to generate the stratas of the first time step and another one before the stratas are available from the read tables. With respect to the interface element, it is a bank of registers to delay the generated gaussian samples until the last one of the group of F is generated and the F gaussian samples can be fed together to the LMM core. The synchronization with the LMM control unit is carried out through the start signal with an initial latency that delays the start of the LMM core until the RNG core is prepared: it has generated the Latin Hypercube for the random samples of the first time step, and the random numbers for the first subpath of the first time step. 134 LIBOR Market Model Hardware Core

5.4.5.2. LMM control unit

The LMM control unit has two main tasks, the synchronization of the data in the LMM datapth and the control of all the logic associated to the the drift terms, being both tasks related. As seen in Section 5.4.3.2, it is necessary a determined scheduling in the order of the LIBORs to make possible omputing on LIBOR per cycle. For LIBORs corresponding to the third term (LIBORs with index i > Q − 1) it is needed to start with i = Q and then increase i by one until reaching N. And for LIBORs of the first terms (LIBORs with index i < Q − 1) the starting index has to be i = Q − 2 and then decrease i by one until reaching the first LIBOR active. Following these requirements, the control unit has been designed to simulate each time step start- ing by the LIBORs of the third term, continuing with the LIBOR corresponding to index i = Q − 1, and ending by the LIBORs for the third term. The reason for this order is that it simplifies the control and the architecture. Simulating i = Q − 1 before i = Q − 2 resolves the problem of the different indexes between the third term (j = i, the LIBOR to be used in the drift is the same as the one being calculated) and in the first term (j = i + 1, the LIBOR to be used in the drift is the previous to the one that is being calculated) in a natural way, requiring only a delay of the input LIBORs for the drift and the first term. Following this order, the indexes for the storage elements are generated and the control signals needed to control the drift:

• Reset the accumulators. • Working with the first, second or third terms of the drift equation.

5.5. LMM Engine Implementation

5.5.1. Operators’ Features

To implement the architecture of the LMM core, Section 5.4.3, we have taken as base the single- precision operators of the final hardware library, Section 3.6.3, which are characterized by the follow- ing features:

• Handling denormalized numbers as zeros.

• Rounding is always done to nearest.

• Hardware flags are used to indicate the type of number.

Meanwhile, from the two final libraries we have selected the one without the one bit exponent extension as the variables of the model take normalized values. 5.5 LMM Engine Implementation 135

The operators from these libraries have to be modified in three ways:

• Mantissa bit-width extension: the precision of the mantissa of the base operators must be ex- tended to achieve the high accuracy demanded by any financial simulation, see Section 5.3.2

• Modification of the pipeline stages: the datapath has the architectural restriction of the drift one-cycle accumulators. Consequently, we are forced to use one stage adders that determine the maximum working frequency. This fact, in combination with the feedback of data that makes desirable the smaller possible number of pipeline stages as possible while the clock frequency is not harmed, implies the modification of the pipeline stages of all the other operators.

• Stall option: it is needed to introduce stall logic in the operators due to the stalls required by the model and the ones required for communications

5.5.2. LMM Core. Precision-Accuracy-Performance

The first task is to define an accuracy criterion to determine how many bits the precision of the operators should be extended. The criterion we have followed is based in two facts. The first one is that the software implementations of scientific and financial simulations relies on double precision arithmetic to ensure the accuracy of the results, but as exposed in [LLK+06], the required accuracy can be achieved with a tailored precision much closer to single than to double precision for most applications. The second fact is the magnitude of the errors related to single precision. If we focus on the relative error, we can see that, in the most accurate scenario, the maximum relative error associated to a single precision number is 0.5 ulp (maximum relative error for an operator implementing round to nearest), value that can be considered negligible:

∆Z Z − Z (20 + 0.5 ∗ 2−23) − 20 ε = = = = 2−24 = 5.6e−6% (5.27) Z Z 20 just affecting to the eighty digit of a decimal number. Based on these two facts, we have set the following accuracy criterion: the maximum difference between a LIBOR calculated in the FPGA core using tailored arithmetic and the same LIBOR com- puted in software with double precision arithmetic should not be bigger than one single precision ulp when both LIBORs are converted to single precision using round to nearest. Being both numbers rounded to single precision from their previous precision, this will imply in the worst case a maximum difference of two ulps between the result of SW and FPGA, meaning a maximum relative error of 2.3e−5%. Once this criterion is set, the task of guessing what precisions fulfill the criterion has to be done experimentally, replicating in hardware and software the same simulation and comparing the results 136 LIBOR Market Model Hardware Core

Figure 5.12: Architecture for LMM accuracy measurement. from both. This implies the use of exactly the same gaussian samples as they are an input for the LMM. However, the gaussian samples generated in hardware also differ from the ones generated in software. Firstly, because of the same circumstance of comparing the double precision samples (software) with the single precision samples of the GRNG. And secondly, because the GRNG is based on the interpolation of a function with its own accuracy criteria. To solve this problem, the GRNG is replaced by a ROM memory containing all the gaussian samples of one subpath, as seen in Figure 5.12, in the same way that all the other calibration data required for a simulation. Finally, the precision determined must present enough accuracy for any set of parameters. Hence, the criterion must be fulfilled in the worst scenario. Due to the error propagation, the more operations are needed to obtain a LIBOR, the worse accuracy will be obtained. This way we can consider that the maximum error will be achieved in the last time step of a path of a simulation with the parameters affecting the number of operations for a LIBOR close or equal to its maximum:

• N = 360

• F = 20

• S = 1303 and therefore we will determine the precision needed using a simulation with those parameters

5.5.2.1. Precision-Accuracy

To set the appropriate precision we have started from single precision and progressively increased the mantissa bit-widths. 5.5 LMM Engine Implementation 137

Figure 5.13: SW-HW difference in average

5.5.2.1. Small Simulation

Prior to test the simulations with the maximum parameters values, a small test has been carried out to have a first idea of the impact of the precision on the accuracy of the results in the LMM model. The parameters of this test have been:

• S=35 → 34 time steps • N=10 • F =7 • Q=8

Meanwhile the number of Latin Hypercubes is irrelevant as subpaths do not share operations. In Figures 5.13 and 5.14 the results obtained simulating one path with the LMM core and dif- ferent mantissa precisions are graphically depicted. The tag 24b corresponds to single floating-point precision (23 mantissa bit plus the hidden bit, 24 bits), while the others are extensions of the single precision up to four more mantissa bits (28b). As can be seen, the accuracy is strongly compromised with single precision as time steps evolve reaching up to eight ulps of difference. The errors are introduced due to the smaller resolution of the operators with respect to double precision software. In the next time steps, these errors propagate while each operation may introduce new errors by the same cause. In this way, the general trend, see Figure 5.13, is that for each new time step, the divergence between the hardware LIBORs and the software ones increases, although in some time steps the divergence is reduced as the new errors introduced neutralize the previous ones. 138 LIBOR Market Model Hardware Core

Figure 5.14: Maximum SW-HW difference

As the precision is extended the divergence decreases due to the bigger precision. On the one hand, it implies a smaller maximum error introduced per operation. On the other hand the inputs for the model are fed to the model with more precision bits and hence with lower error.

5.5.2.1. Maximum Simulations

To explore what precision fulfills the defined accuracy criterion, we have started with a simulation close to the maximum parameters:

• S=1,254 → 1, 253 time steps

• N=363

• F =20

• Q=243

Now we have used as base precision the previous 28b, single precision with four extra mantissa bits. In Figure 5.15 we have depicted the results obtained in the last time step, where bigger errors are expected, for the first LIBORs remaining active. As can be seen, the precision that previously was enough for a small simulation now is far from obtaining accurate results, reaching up to 14 ulps of difference between the software LIBORs and the 28b hardware ones. However, if the precision is extended in another four mantissa bits the defined criterion is fulfilled again, as can be seen in the figure for the plotted LIBORs. 5.5 LMM Engine Implementation 139

Figure 5.15: SW-HW difference in the last time step

Table 5.2: Implementation Results for the Modified operators. +/- * / xy V4 32b V5 32b V5 40b V4 32b V5 32b V5 40b V4 32b V5 32b V5 40b V4 32b V5 32b V5 40b Slices 412 256 373 94 39 168 487 261 410 1098 660 749 DSP - - - 4 2 2 - - - 11 6 8 BRAM ------13 11 21 Stages 1 1 1 1 1 1 5 5 6 7 6 8 MHz 59.4 84.5 59.0 71.7 101.4 78.5 60.7 89.2 66.3 59.6 61.9 62.7

5.5.2.2. Precision-Resources-Performance

The operator’s precision increase has a direct impact on the resources of the operators and their performance. As the precision is increased, the needed resources also increase: the operators have to handle bigger operands and all the logic related to mantissas has to work with more bits while the storage elements and registers work with bigger data. Additionally, operators relying on digit recurrence methods require more steps to compute the extra precision bits and therefore increasing their resources usage. Regarding the performance of the operators, the tasks related to the mantissas imply longer and slower logic paths as the mantissa size increases harming the performance. With respect to digit recur- rence methods, new calculation steps will imply new pipeline stages to avoid performance penalties. In Table 5.2 these circumstances are summarized with the results for the post place and route implementation of the operators once they are modified following the features needed for the LMM core, Section 5.5.1. Two different FPGAs have been used, the base Xilinx Virtex-4 XC4VFX140-11, and a Xilinx Virtex-5 XC5VFX200T-2. This second FPGA has been necessary as the huge number of operators of the LMM Core exceeds the resources provided by the base Virtex 4, forcing the upgrade to a bigger FPGA. Single precision operators are denoted as 32b and the operators with the 8-bits extended mantissa are denoted as 40b. 140 LIBOR Market Model Hardware Core

Table 5.3: Implementation Results of the LMM Engine for V5-FX200. Slices DSPs BRAM Stages MHz RNG 4703(15%) 13 (13%) 7 (1%) 22 85.2 22783 (74%) 130 (33%) 124 (27%) 21 50.1 LMM Product 23732 (77%) 157 (40%) 133 (29%) 46 50.1 25291 (85%) 144 (37%) 134 (29%) 43 50.1 Engine Product 27242 (88%) 171 (44%) 143 (31%) 58 50.1

Additionally to the impact of the mantissa extension (comparison of V-5 32b operators with V-5 40b operators) it can be seen that the FPGA family upgrade has a major impact on resources and per- formance (comparison of Virtex-4, V-4, 32b and Virtex-5, V-5, 32b operators) due to the technology improvements (90nm (V-4) to 65nm (V-5) base technology) and architectural ones (4-input LUTs, 18x18 multipliers and 18 KB BRAMs (V-4) to 6-input LUTs, 25x18 multipliers and 36 KB BRAMs (V-5)).

5.5.3. Cores Implementation

Table 5.3 summarizes the post place and route implementation results (with and without product valuation) for the RNG adapted to the architecture parameters, the LMM core and the complete engine integrating both cores, their interface and their controls. Due to the huge number of floating- point operators involved, the LMM is extremely resource-demanding, reaching up to a 74% (77%) of the slices. For the selected FPGA it implies that no Monte Carlo replications can be done in parallel as another core cannot be introduced. With respect to the working frequency, as seen in Section 3.5.3, the performance is harmed as more resources are used due to routing issues. In this way, the working frequency of 59.0 MHz of the adder, which represents the slower path in the system, is not achievable. In this case, it falls to 50.1 MHz, that has been achieved only when timing constraints have been used for the place and route tasks. Otherwise, the clock frequency falls even more. Finally, with respect to the product valuation, it can be outstood the great increase in pipeline stages it represents, in contrast to the small increase it represents in resources and number of operators. The use of two divisors and the exponentiation unit is the main reason for this.

5.6. Conclusions

Once having at one’s disposal all the basic elements needed for the design and implementation of an FPGA accelerator, many other issues and tasks must be addressed. In this chapter, we have imple- mented a complete accelerator engine for the LMM carrying out many of these tasks, outstanding: 5.6 Conclusions 141

• Algorithms analysis.

• Adaptation of the model and specific modification for FPGAs.

• Precision-accuracy analysis.

• Architecture design.

• Basic elements modifications.

• Integration within different cores.

• Control design.

Understanding all these tasks and analyzing how they impact on the design and implementation of hardware acceleration is essential to achieve a reliable and high performance accelerator. Two options have been considered and fully implemented, an accelerator comprising the Monte Carlo simulation of the LMM asides with the random number generation, and the same accelerator but also including the product valuation. To match up the parameterization required by LMM Monte Carlo simulations, we have developed an adaptable architecture based on the maximum expected value of the parameters affecting the architecture. Finally, and due to the limitations introduced by data feedbacks, to fulfill the highest performance goal of any accelerator, we have oriented the architecture to implement as many operations in parallel as possible and to compute one variable of the system per clock cycle. In this chapter we have demonstrated that complex applications can be implemented in an FPGA although a careful design has to be carried out after a deep analysis of the implemented model and its relation with other cores that are needed, in this case a Gaussian RNG with Latin Hypercube. Finally, we have also seen that FPGAs can take advantage of tailored precisions. In a complex model with several data dependencies, and thereby error propagation, as the LMM a single precision floating-point extended with eight mantissa bits has been enough to fulfill the exigent criterion we have adopted. Meanwhile a full paper on the implementation of the LMM engine is under preparation, part of the work carried out in this chapter has been published in a DATE 2011 workshop [ELV11a]:

• Pedro Echeverría and Marisa López-Vallejo. FPGA acceleration of Monte Carlo-based finan- cial simulation: Design challenges and lessons learnt. 2011.

6

Hardware-Software Integration

The development of a hardware accelerator is incomplete without its integration within the system and application where it is going to be embedded. Data computed in the hardware accelerator must be fed to the target application through the system where both are running. The integration comprises both software and hardware elements and how it is done is a key part of the global performance of the accelerator. In this way, the real performance of any accelerator is not only given by how fast it can work isolated, but also by how fast it can communicate the computed data. In the same way, the global per- formance of the application is also dependant on how fast the application can access to data computed by the accelerator. Another key issue in the application’s global performance is how the application is divided into hardware and software, the system partitioning. One of the objectives of this chapter is to study the above mentioned issues through the integration of the developed LMM Engine (described in the previous chapter) within a software application. An analysis of how it can be done and what elements are required is done. Then, we detail the hardware- software integration carried out, which is based on a personal computer and through the PCI-Express (PCI-E) bus. We will focus not only on the implementation itself but also on the integration features 144 Hardware-Software Integration

(both software and hardware) that have an impact on the total acceleration. The second main objective is to complete the evaluation of the developed engine in a real scenario to measure the real speedup achieved and to validate the complete accelerator. Additionally, in this chapter, the hardware-software partition policy we have followed prior to the development of the accelerator, is explained.

6.1. Hardware-Software Partitioning

To decide how hardware-software partitioning can be carried out, we have identified three key features that have to be analyzed:

• Tasks stability characteristics.

• Communication overheads.

• Achieve maximum possible acceleration.

6.1.1. Tasks Stability Characteristics

Developing a hardware accelerator requires a long design cycle. Thus, one key rule in the hardware- software partitioning is to port to hardware those tasks that, in addition to present heavy computations, are very stable in the sense that they will not require frequent code changes and modifications. Mean- while, software is much more flexible and any task subject to frequent changes should remain in software. In our particular case, this key rule is related to how the LMM is used within financial simulations. As introduced in Section 5.5.2, on the one side the LMM itself is a complete model whose output is a curve of interest rates. The LMM (including the random number generator) can be defined as an almost static task, as the model it implements is very stable and changes in it are infrequent. This way, it presents ideal features for a reliable FPGA accelerator. On the other side, the LMM curves of interest rates are mostly used to value different financial products by means of Monte Carlo simulations, the product valuation. In this case, there are plenty of products with very different features as the traders buy or sell products tailored to the necessities of their clients. This implies that, in addition to the complexity of implementing a set of many different products in hardware, they are also frequently changed to introduce new features or even new products are created. In this way, product valuation requires of a flexibility which is more suitable for software. 6.1 Hardware-Software Partitioning 145

6.1.2. Communication overheads

Splitting an application between two processing cores requires that data computed are communicated between the cores. These data transfers imply a communication task which originally was not present in the application. Depending on how this communication task is carried out, the performance can be affected due to communication overhead. Furthermore, communication transfers between a host PC and a hardware accelerator imply the use of mechanisms that can affect the performance of both hardware and software. In this way, it is necessary to carefully select where we split the application to minimize both the computation overhead of the transfers and number of transfers required.

6.1.3. Achieve maximum possible acceleration

Amdahl’s Law [Amd67] limits the total speedup achievable when parallel computations are carried out, as the use of an accelerator can be understood, according to the longer task which is not paral- lelized:

1 Speedup = rp (6.1) rs + n where rs is the ratio of the application which remains sequential, rp the ratio of the application parallelized (rs + rp = 1) and n the speedup obtained for the parallelized part. In this way, the speedup obtained by the use of any accelerator will be equal or smaller than the maximum achievable value related to the part of the application in software:

1 Speedup = (6.2) rs for n → ∞.

Following Amdahl’s Law, it would be desirable to make rs as small as we can. In our case, it will imply to implement product valuation in the FPGA.

6.1.4. Partitioning Policy

In the case of the LMM we have a clear conflict between the tasks stability characteristics and the aim of obtaining the maximum possible acceleration from the partitioning, due to the non-stable features of the product valuation. In our case, we have prioritized the tasks stability so the base partitioning carried out leaves prod- uct valuation in software. In the following sections, the hardware-software integration will be focused 146 Hardware-Software Integration

Figure 6.1: Integration Architecture. on this partitioning. However, implementation results also comprise the case where the product valu- ation of the product implemented as an example in Chapter 5.

6.2. System Architecture and Communications

The integration solution adopted is based on the use of an FPGA in a board with a PCI-E connector as communication interface. The different elements and types of data transfers involved in the hardware- software integration phase are shown in Figure 6.1. In the middle of the figure, between the two processing elements, the CPU and the FPGA, the three different forms of data transfers we have used are shown:

• PIO: Programmed Input/Output.

• DMA: Direct Memory Access.

• IRQ: Interrupts. where the RAM memory of the system is shared between the CPU and the FPGA through the DMA communication subsystem which is in charge of most of the data transfers. In the bottom part of Figure 6.1 we can find the prototyping board, a HiTech Global HTG-V5- PCIE-200 [Glo], with the selected FPGA (Xilinx Virtex 5 FX200T) and a PCI-E end-point of 8 lanes. The FPGA has to implement not only the hardware accelerator, but also a communication core to 6.2 System Architecture and Communications 147

Figure 6.2: Communications Flow manage the PCI-Express end-point, and an interface element between that core and the developed accelerator engine. Meanwhile on the top part of Figure 6.1 we can see the software application executed in the CPU of the host computer. In any application relying on hardware acceleration, the accelerated code has to be replaced by calls to the accelerator and structures to send and receive data to the accelerator. These calls or routines form a low level library that is in charge of communicating the application with the computer program that handles the bus, the driver. This communication flow between the software application and the hardware engine is depicted in Figure 6.2.

6.2.1. Why PCI-Express?

The choice of the PCI-E [PS] as communication bus is due to several reasons:

• High-speed and flexible high-bandwidth. The PCI-E is a full-duplex high speed serial connec- tion that is adaptable to different requirements. The key of its flexibility and high speed are the lanes, as a PCI-E interconnection can be formed of 1, 2, 4, 8, 16 or 32 lanes, where each lane can transfer 2.5 Gbits per second in its slowest version, with a data transfer of 250 MB (there is an 8b/10b encoding for physical transfer, meaning a 20% overhead). • Bus Master DMA. The use of Direct Memory Access communications with the hardware as Bus Master opens many possibilities. On the one hand, it allows transparent transfer of data using RAM memory. Data is stored directly in the RAM memory by the producer, where it is accessed by the consumer when it needs that data. On the other hand, as the hardware can be Bus Master, it can initiate DMA transfers on its own. Both features reduce the computational load of the software as the only direct communication needed for DMA transfers is the one needed to signal that the data are available and where they are. • Evolution. Although the data transfer bit-rate is not a hard constraint for our accelerator, other applications can demand extremely high bit-rates. The first version of PCI-E is capable of a 148 Hardware-Software Integration

data transfer of 250 MB per lane, that is doubled to 500 MB for version 2, and has been doubled again for version 3, the last version. Hence, it offers a massive bit-rate ensuring that if other applications are implemented or our implementation is improved, (or faster subcomponents are implemented as accelerators themselves) we can continue using PCI-E. • "Almost everywhere-Almost standard". The PCI-E is becoming a de facto standard nowadays and it can be found in almost every system, especially in motherboards of PCs and servers. Its combination of high-bandwidth and flexibility has made that PCI-E is used as a universal bus for hardware expansion slots displacing other high-speed and high-bandwidth buses as PCI-X or AGP.

6.2.2. Communications Model

The communications infrastructure between application and accelerator must be designed to minimize its impact on the global performance. From the software perspective, communications must imply as low computational overhead as possible while for the hardware, communications must allow the accelerator to work at its maximum throughput. Otherwise the accelerator should be stopped, see Section 6.3.1. Following these ideas, several facts have to be analyzed to define the communication model:

• Data flow of the application and the impact of data transfers in it. • Different types of data or events to transfer and the quantity of data of each type.

With respect to the first point, Figure 6.3(a) shows a simplified dataflow representation of the software application we are accelerating. As explained in the previous chapter, before starting the Monte Carlo simulation a calibration of the model is needed to compute the variables which remain constant through all the Monte Carlo replications. Afterwards, the Monte Carlo simulation starts, being composed of with two main tasks, the LIBORs calculation and the product valuation using the generated LIBORs interest rate curves. These two tasks are carried out subpath per subpath once the random numbers for the group of subpaths are generated. When the accelerator is introduced, the dataflow changes as shown in Figure 6.3(b). First, the simulation parameters and the data from the calibration have to be transferred to the FPGA as they are required for the LIBORs’ simulation. Second, on the Monte Carlo simulation itself, the calculated LIBORs in the FPGA have to be transferred to the software for the product valuation. However, now we have two computing elements that should work in parallel to obtain the maximum performance from the system. In this way, the FPGA should be working in advance with respect to the software: while in software a group of LIBORs are being processed, the FPGA must be computing the next ones. In the software dataflow, this group of LIBORs is a complete subpath, however, as explained in Section 5.3 the hardware does not follow the same simulation order than software as it simulates 6.2 System Architecture and Communications 149

(a) Top-view Software Dataflow.

(b) Top-view Software-Hardware Dataflow.

Figure 6.3: Dataflows. 150 Hardware-Software Integration

Table 6.1: Types of Data Transfers. Massive Individual Events SW to HW Calibration Data DMA control data Control Signals Constant Data Simulation Parameters HW to SW Calculated LIBORs Control Signals all related subpaths together. Hence, now the FPGA will transfer a grouped path with all the LH subpaths. Additionally, the communications model has to take into account that LIBORs’ transfer is not done at once but as P/LH transfers. With respect to the different types of data to transfer several classifications can be done. First, we can distinguish based on the transfer direction, from the application to the accelerator and from the accelerator to the application. Second, based on the amount of data to transfer, we can distinguish between massive data transfers, suitable for DMA, or small or individual transfers as the simulation parameters, suitable for PIO communications. Finally, based on the kind of information transferred, another differentiation has to be done for the data signaling events. All these data transfers are sum- marized in Table 6.1. Additionally to the data transfers of the model itself, in the table we can also find two more types of data, the DMA control data and the control signals. For the transfers which involve massive data, the DMA communication implies the fastest option as it requires a minimum software overhead: the data to transfer is copied to RAM, from where it is accessed by the element that requires that data. In the case of the CPU, that data can be used directly from the RAM, while in the case of the FPGA a copy of it into the FPGA internal memory elements is needed. To avoid overloading the software with most of the communications handling, the PCI-E of the hardware can be configured to be Bus Master with the necessary control data (RAM initial address and data size). Finally, control signals are needed for the synchronization between the software and the hardware. The main task of this synchronization is signaling when a DMA transfer is completed. In these cases, the hardware (the Bus Master) has to notify it to the software through an interrupt. Signaling these events with interrupts allows the software to continue working instead of being stalled waiting for the event.

6.2.3. Communications Requirements

To make possible the parallel scheme, where the FPGA is calculating one grouped path while the CPU is processing the previous one, and to get the maximum performance, the CPU should not be stalled. In this way, when the CPU finishes processing one grouped path, the LIBORs of the next path should be ready in the RAM memory. Thereby, the FPGA must transfer to RAM the LIBORs of the 6.3 PCI Express Core 151 path it is calculating while the CPU is accessing to the LIBORs of the previous path. If the FPGA always transfers the calculated LIBORs to the same area of the RAM memory, this working scheme will lead to overwrite of LIBORs not being processed yet by the CPU. To solve this situation a dual area memory scheme is needed, defining two different areas of the RAM memory for the LIBORs transfers while the FPGA and the CPU swap between them. Following this scheme, the FPGA will transfer one complete data set to one of the two RAM areas, while the CPU is working with the previous path, that was transferred to the other area. When the CPU finishes with the area it is processing, it will swap to the other area where the FPGA has transferred the next path, while freeing the area it was using and letting the FPGA transfer a new path to the freed area. This scheme requires synchronization signals to communicate when the FPGA has completed a path transfer (interrupt) and a signal from software to signal when the CPU is freeing one of the RAM areas for the FPGA (PIO access). To ensure the maximum performance, another requirement must be fulfilled: the communications bus and selected scheme must be able of transferring the data calculated as fast as they are generated. If the transfer bit rate is slower than the throughput of the accelerator, the accelerator performance will be harmed and it should be stopped, see Section 6.3.1.

6.3. PCI Express Core

The developed accelerator must be integrated within the FPGA with a PCI-E core capable of manag- ing the data transfers between the accelerator core and the software. For the selected FPGA and board, the Xilinx Endpoint Block Plus for PCI Express [Xilc] is avail- able. This core implements the logic needed to receive and transmit the Transaction Layer Packets (TLPs) that encapsulates the data to transfer with the protocol information. Additionally, it provides an interface for the user logic (transaction interface) on the one side, and the connection with the FPGA RocketIO transceivers connected to the PCI-Express pins on the other side. Additional hardware is needed to control the communications (Bus Master DMA, PIO and inter- ruptions) and to process the TLPs to extract the data from them, when they are received, and to form the TLPs from the data to be transmitted. Again, Xilinx provides a reference design [Xila] for this purpose. However, as a demonstrator it is far away from being a complete core, but it is a good base to develop a complete core. In this way we have modified it to be used in a real application:

• Interface for a real application. Management of input and output data. • Completion of the Bus Master control. • Adaptation to the requirements of the target application. • Interrupt manager adapted to the interrupts needed. 152 Hardware-Software Integration

Figure 6.4: PCI-Express Core

Figure 6.4 depicts a simplified scheme of the complete PCI-E core we have integrated. On the top part, the Xilinx Endpoint in charge of managing the TLPs is found. The interface of the endpoint with the rest of the core is composed of three main units, the Receiver (RX), the Transmitter (TX) and the Interruption Manager. The TLPs for the hardware are processed by the Receiver to extract the data and determine the type of input communication. The Transmitter composes the TLPs with the data to transmit to the software and generates control TLPs with the Bus Master DMA instructions. The third main unit, the Interruption Manager, starts the interrupt requests and handles all the interrupt signals. The DMA and PIO control is in charge of all the information related to data transfers which is not data (addresses, state of the DMA, etc.) and coordinates all the control signals between the endpoint and the three interface units and within these units. As the hardware is bus master, this control also handles how the TX has to compose DMA TLPs (data request for DMA reads and write request with its associated data for writes). The last main component is the BAR (Base Address Register), which is a set of registers which can be accessed by the software through PIO instructions.

6.3.1. Within FPGA Communications

As seen in Figure 6.4, the PCI-E core sends and receives data from the accelerator core. However, it is necessary some kind of interface adapter between both cores due to two main reasons: 6.3 PCI Express Core 153

• Different data handling.

• Cores with different working frequencies.

First, the PCI-E (and therefore the PCI-E core) works just with one class of data, the Data Word (DW), which corresponds to a 32 bit word. Although the communications through the bus are based on Bytes, the minimum data to transmit is a DW and bigger transmissions should be multiples of DW (a scheme based on masks is used for smaller data than a DW). A PIO transfer corresponds to just one DW, while in a DMA transfer several DWs are handled comprising several clock cycles. In each clock cycle we can find one data DW (the first and/or last data DW) or a pair of data DW (all the other data DW). However data to and from the accelerator are not restricted to 32-bit words. Furthermore, the way how data are generated could not correspond to the PCI-E core data handling, requiring a data adapter interface. In second place, the clock frequency of both cores does not have to be the same. The PCI-E core frequency is fixed to two possible frequencies, 100 or 250 MHz (in our case, this last one is used), related to the bus frequency, while the accelerator should have the highest possible frequency. Thereby, the interface adapter not only has to solve the different data handling but also save the synchronization problems between two cores working at different clock cycles. This circumstances can be found not only in our case study application, but in any other applica- tion. Hence, we have chosen to develop a general interface that can be used for our accelerator or any other.

6.3.1.1. PCI-Express & Accelerator Interface

Developing a general interface requires to split the data handling into two adapters: while the data adaptation to the PCI-E core is handled by the interface adapter, the specific data handling for the accelerator is not carried out in the adapter. In this way, the interface is prepared for single DW for the PIO transfer and the one or two DW for the DMA transfer. The general solution adopted is shown in figure 6.5. Aynchronous FIFOs and registers are required (in the center of the figure) to solve the synchro- nization between the PCI-E clock and the accelerator clock. For the DMA communications, each asynchronous FIFO is composed of two FIFOs with a bit-width of one DW each. Associated to them two adaptation FIFOs are requested to ensure correct transmissions and receptions, as DMA trans- missions or receptions can finish before a TLP is completed due to any problem in the Bus or the Xilinx core. The adaptation FIFOs handle these situations freeing the asynchronous FIFOs where it will be much more difficult to handle due to the two clock domains. In addition to the data itself, it is required to send the DMA information (data’s type and format, 154 Hardware-Software Integration

Figure 6.5: PCI-Express & Accelerator Interface number of words and destination), left side of the figure. This information corresponds to a few bytes and can be handled by an asynchronous register. In the figure we have just represented the mechanism for DMA transfers to the accelerator, although it also can be needed for DMA transfers to the PCI-E. In our specific case, it is not needed as only one type of data is sent to the application, the calculated LIBORs. The specific data handler of the application will use the DMA information to determine which data is being received, its format, and size and will carry out the necessary tasks to store it in the appropriate storage element and with the appropriate format. In the case of the calculated LIBORs it will transform the hardware format (float plus 8 extra mantissa bits) into the software one, double precision floating-point format. The working scheme is the same for the PIO transfers, although instead of asynchronous FIFOs we have opted for just an asynchronous register for each direction, as they are isolated transfers. Nevertheless, a small synchronous FIFO is attached to the registers for security and not to lose any PIO communication.

6.3.1.2. Communications Stalls and Performance

When the asynchronous FIFOs are full, the source has to wait until the receiver processes the data. These situations have an impact on the total performance of the system. Stalls of the PCI-E Receiver have a limited impact as DMA communications to the FPGA are required only once, when preparing the hardware accelerator with the calibration and the constant data. However, the stalls of the hard- 6.4 Software 155 ware accelerator can have a bigger impact on performance and harm the throughput of the accelerator. Two circumstances can provoke these stalls:

• The way the PCI-E bus works. • The software application itself.

The PCI-E bus requires synchronization and other control TLPs, causing that not always TLPs data can be sent. On the one hand, it is a shared bus working with a credit base system to decide which hardware (in case more hardware systems want to use the bus at the same time) has the right to use the bus, having the same effect, not allowing our hardware to send data TLPs. These two situations can be effectively handled by providing asynchronous FIFOs with enough positions as the clock rate of the PCI-E (250 MHz) is much higher than the accelerator clock rate. When, data TLPs are not allowed, the FIFOs will absorb all data from the accelerator, and when data TLPs are allowed again the higher frequency of the PCI-E will free the FIFOs. On the other hand, the application itself can make that the accelerator has to be stalled when the software is not able to process the LIBORs at the same rate as they are produced, so the accelerator has to wait for the software. In this case, the accelerator has to wait until the software frees the memory zone it is using once the accelerator has finished writing the next path to the other memory zone. This situation cannot be avoided and it means we are wasting part of the performance of the accelerator.

6.4. Software

The make possible the use of a hardware accelerator within a software application, the part of the program which is accelerated has to be replaced by calls or subroutines to the hardware accelerator. These calls or subroutines form a group of low level functions that allows the programmer to handle the accelerator through general commands in an easy way. The complexity of handling the accelerator is left to the driver. Finally, the software application has to be also modified in case the hardware accelerator does not work exactly with the same dataflow as software, as it is our case.

6.4.1. Driver and Low Level Functions

The low level functions form an interface to handle the driver from the application, and must be as independent as possible of the operating system (OS) where the application is executed. We have developed these functions to make them as general as possible and relying on a set of arguments to make as transparent the details of the hardware implementation (a list of defines encapsulate the hardware details). The main routines carried out are: 156 Hardware-Software Integration

• Initialization of the driver. • Initialization of the two memory areas. • Routine for asking if there is data available for the SW. • Free memory areas for the HW. • Send DMA control data to HW so the HW can by Bus Master. • PIO data write and read. • Start and reset instructions to the HW.

On the other hand, the driver is the program used by the OS to handle some specific hardware, in our case our accelerator. In this way it is completely tied to the specific OS and the hardware used. We have selected a Linux OS, Ubuntu 10 [Ubu] with a 2.6.32.8 kernel for the prototype. In Linux, it is only needed to develop the part of the driver required to control the hardware and transfer information to it using general system instructions. All the other tasks, as the control of the bus are carried out by other drivers which are part of the kernel, allowing to develop a driver which is only focused on the functionalities needed. In this way the driver handles:

• The connection to the low level driver which handles the bus and to the hardware previously found by the low level driver. • Memory allocation for the driver. • Translation of the low level functions into the operating system hardware instructions. • Interrupts Handling.

Of these tasks, the most outstanding one for the performance of the system is the interrupts han- dling. The handling of the hardware interrupts means a software computational overload that harms the performance. In this way, the number of interrupts generated in the hardware should be mini- mized.

6.4.2. Application modification

As previously introduced, some changes in the application are required to introduce the subroutines that replace the part of the software which is being accelerated with the routines that obtain the results from the hardware. In our case, some extra tasks are needed. First, the preparation of the hardware with the calibration and constant data and the parameters to carry out the simulation. In this task all the data transfers to the hardware are carried out and finally the start instruction is sent to the hardware. Once these tasks are done, the replaced software is substituted by the routine to ask if there is hardware data available or to wait until it is available. Now the data is accessible by the software and 6.5 Experimental Results 157

Figure 6.6: Detailed-view Software-Hardware modified dataflow it can process them as if they were software generated. However, as was advanced in Section 6.2.2, the hardware calculates a grouped path instead of one subpath at a time, calculating the LIBORs of all the subpaths for each time step of the simulation previously to advance to the next time step, Section 5.3.1. Hence, it is required to transform how data has been transferred from the hardware to how the data is expected by the software LIBORs processing. In this way, it is necessary to introduce a routine to reorder in software the data returned by the FPGA to completely adapt the software application. As the software process just one subpath at a time, routine extracts all data of the corresponding subpath previously to start processing it. The complete software-hardware dataflow with these modifications is shown in Figure 6.6. The three new software tasks imply a computational overhead harming the total acceleration, although their impact is rather different. The preparation of the hardware is done just once, implying a small computational overhead. Meanwhile the impact of the routine to know if the data is ready is negligible as it is just a wait state if data is not ready. However, although the reorder routine mainly implies fast instructions like data reads and writes and index handling, it can have a major impact as it is repeated for each subpath while the data involved in the reorder can represent several megaBytes.

6.5. Experimental Results

To test the speedup that can be achieved with the integration of the accelerator within a software application we have selected to run a set of two Monte Carlo simulations which mainly differ in the product valuation. Additionally, for both case studies, three different ratios between time steps and fixing dates are analyzed. The experimental environment is composed of the prototyping board (HiTech Global HTG-V5- PCIE-200 with a Xilinx Virtex 5 FX200T) and the host system CPU which is an Intel i7 920 [Cor](4 cores, 2.67GHz and 8MB cache) with a RAM memory of 4GB. The base software used is a financial 158 Hardware-Software Integration

Table 6.2: Implementation results of the complete accelerator for a V5-FX200. Slices DSPs BRAM Working Freq. Endpoint 768 (2%) - 3 (0.6%) PCI-E Core 250 BMD 1259 (4%) - 1 (0.2%) Interface 186 (0.6%) - 6 (1%) 250-50 Complete 25493 (82%) 144 (37%) 144 (31%) 50 Accelerator Product 27204(88%) 171 (44%) 153 (33%) software that we have adapted to integrate the FPGA. This section is split in three parts. First, the implementation results for the complete core in the target FPGA are summarized. Second, a profiling of the software execution of both simulations is carried out with a theoretical approximation to the achievable speedup based in those profiles. The profiling has been carried out using timers within the software with microseconds precision. And finally, the speedup results obtained for the simulations when the hardware accelerator is used are presented and analyzed.

6.5.1. Complete Accelerator Implementation Results

Table 6.2 summarizes the implementation results for the complete accelerator and the new modules required for communications, the PCI-E core and the adapter interface. The complete implementation relies on two clock signals. The first clock (250 MHz) is obtained from the PCI-E bus and the prototyping board. Its frequency ensures that communication transfers will not become a bottleneck harming the performance. The second clock, a 50 MHz clock, is derived from the 250MHz and is selected following the results obtained in Chapter 5. The frequency of these clocks have to be used as timing constraints for the post-place & route to ensure a correct behavior of the FPGA logic. Additionally it has been necessary to introduce area constraints, otherwise the timing constraints were not achievable. As can be seen in the Table 6.2, the PCI-E core (divided into Xilinx PCI endpoint and Bus Master DMA logic (BMD)) and the interface adapter require a reduced number of resources having a very little impact in the global figures.

6.5.2. Software Profiling

We have selected two real case studies with different product valuation which differ in their grade of complexity and hence in their associated computational load. On the one hand we have the cap used as product valuation example in the previous chapter. On the other hand, we have a triple range accrual [Ran](TRA from now on), which presents more complex computations than the cap as is based on three factors. 6.5 Experimental Results 159

Table 6.3: Test Simulation Features Nodes S LH F N Q 4 109 Cap 12 365 20 20 224 108 24 730 4 103 TRA 12 289 20 20 133 93 24 569

We have simulated three scenarios for both products, based on the ratio of time steps versus fixing and maturity dates. The LIBORs for both cases are quarterly LIBORs (four fixing dates per year) and the three scenarios consider the following time steps per year (from now on nodes):

• Quarterly: 4 time steps per year.

• Monthly: 12 time steps per year.

• Fortnightly: 24 time steps per year.

Thereby, we have an increase from 1 time step per fixing date and payoff date (these dates are related to fixing dates) to 6 time steps per fixing date. Consequently, as time steps are increased the execution time devoted to the LMM will increase while the time devoted to the product valuation remains almost stable. The main features of the two case studies are summarized in Table 6.3 for the three selected time steps. Additionally to the different product valuation (and their associated complexity) the TRN presents a smaller number of LIBORs (the simulated interest rate curve lasts less years). These three circumstances, complexity, number of LIBORs to simulate and the different number of nodes, lead to extremely different software profiles when the time devoted to LMM and GRNG on one side and the product valuation on the other side are considered. One last consideration is the number of LIBORs remaining alive in both simulations in the last time steps. In both cases there are more LIBORs than the value of F (which determines the number of gaussian variables per time step) and hence the number of alive LIBORs will be the dominant parameter in those last time steps. Table 6.4 summarizes the software profiling carried out. In first place, it can be seen that the Pre & Post process, which includes the initial calibration and final statistical measures, remains almost constant as the number of paths and the number of nodes per year increase. As mentioned before, the cap example presents a bigger number of LIBORs and therefore it has more related data to handle in the calibration and initialization requiring a longer time for these tasks. The other three measured tasks, correspond to the Monte Carlo simulation itself and are propor- tional to the number of paths implemented. The times required for the random number generation and the LMM are related to the number of variables to be computed (summarized in Table 6.5). 160 Hardware-Software Integration

Table 6.4: Software Profiling.

Grouped Total Pre & Post GRNG LMM Product (LMM+RNG)/ Nodes Paths Time (s) Process (s) (s) (s) (s) Product 100 21.31 9.03 1.53 10.62 0.12 89.37 4 200 33.39 8.84 3.09 21.21 0.24 88.66 500 70.23 8.88 7.71 53.03 0.59 89.15 100 45.96 8.93 4.61 32.27 0.15 220.25 Cap 12 200 83.30 9.02 9.25 64.72 0.29 220.37 500 194.20 9.15 23.09 161.21 0.74 216.92 100 83.44 8.97 9.20 65.10 0.15 427.55 24 200 157.85 9.05 18.41 130.07 0.31 421.76 500 381.62 9.50 46.14 325.20 0.76 427.81 100 10.47 2.25 1.45 4.6 2.17 2.12 4 200 18.64 2.20 2.91 9.22 4.32 2.13 500 43.31 2.22 7.24 23.04 10.81 2.13 100 21.95 2.24 4.06 13.46 2.19 6.15 TRA 12 200 42.56 2.26 8.70 27.20 4.40 6.18 500 100.80 2.36 20.29 67.33 10.82 6.22 100 39.72 2.31 8.07 27.09 2.25 12.02 24 200 77.16 2.43 16.11 54.11 4.50 12.03 500 189.46 2.79 40.16 135.22 11.28 11.99

Table 6.5: Main variables to compute per subpath Cap TRA Nodes 4 12 24 4 12 24 LIBORSs 18530 55583 111275 8554 24736 49143 Gaussians 2180 6520 13040 2060 5780 11380 6.5 Experimental Results 161

Table 6.6: Extrapolated Profile (5000 grouped paths) Total Pre & Post GRNG LMM Product LMM+RNG Nodes Time (s) Process (s) (s) (s) (s) FPGA (s) 622.18 8.88 77.1 530.3 5.9 4 37.06 (1.49%) (12.39%) (85.23%) (0.95%) 1859.55 9.15 230.9 1612.1 7.4 Cap 12 111.2 (0.49%) (12.42) (86.69%) (0.40%) 3730.50 9.5 461.4 3552.1 7.6 24 222.5 (0.25%) (12.37%) (87.17%) (0.20%) 413.12 2.22 72.4 230.4 108.1 4 17.1 (0.54%) (17.53%) (55.77%) (26.17%) 986.76 2.36 202.9 673.3 108.2 TRA 12 49.5 (0.24%) (20.56%) (68.23%) (10.97%) 1869.39 2.79 401.6 1352.2 112.8 24 98.2 (0.15%) (21.48%) (72.33%) (6.04%)

Meanwhile, the product valuation is related to Q and the number of fixing dates. As both remain fixed for the three nodes, the time required for the valuation should remain stable when the nodes are increased. However a small increase can be observed due to the memory requirements in software: the data structures to be handled increase with the number of nodes harming the performance due to memory accesses.

6.5.2.1. Achievable Speedup

To guess what theoretical speedup can be achieved when the accelerator is employed, on the one hand, we can extrapolate the profiling carried out to consider a real number of paths (see Table 5.1). On the other hand, we can use the working frequency of the LMM engine of the FPGA (as it is capable of achieving one calculated LIBOR per clock cycle). Targeting a typical value of 100000 paths (corresponding to 5000 grouped paths as simulations LH parameter is 20) while considering negligible the increase of time in the Pre & Post Process we would obtain the following profile, see Table 6.6. When computed in software, the random number generation and the LMM are sequential tasks but when computed in the accelerator, they are parallel tasks. In this way, the FPGA execution time is determined by the LIBORs computation. Prior to guess the achievable speedup, we have to focus on how the driver and the communication mechanism have been designed taking advantage of intrinsic Monte Carlo parallelism. As two memory zones are used to return the FPGA computed LIBORs to the CPU, software and hardware can work in parallel: when there are LIBORs calculated to be processed by the software and a free memory zone to return LIBORs from FPGA. The theoretical achievable speedup (in x times) is summarized in Table 6.7 for the migrated 162 Hardware-Software Integration

Table 6.7: LMM, GRNG and LMM+RNG Speedups Cap TRA Nodes 4 12 24 4 12 24 GRNG 17.7 17.7 17.7 17.6 17.6 17.6 LMM 14.3 14.5 14.6 13.5 13.6 13.8 LMM+GRNG 16.4 16.6 16.7 17.7 17.7 17.8

Table 6.8: Extrapolated achievable speedup Cap TRA Nodes 4 12 24 4 12 24 speedup (x Times) 13.5 15.5 16.1 3.7 8.9 17.4 code taking the 50MHz FPGA clock as reference. These results represent the maximum achievable speedups as they are isolated from any other code remaining in software. Following all these features, the total achievable speedup will follow the next simplified model:

TSW Speedup = 1path 1path (6.3) TP re&P ost + max(TF P GA,TPV ) + min(TF P GA,TPV ) where TP re&P ost is the pre & post process software time, TF P GA is the computation time for the 1path FPGA, TPV is the software execution time for the product valuation and T corresponds to the time devoted to one grouped path in the FPGA or in the software product valuation. The equation comprises two different situations: • CPU Monte Carlo task is faster than FPGA Monte Carlo task. When the CPU processes one grouped path it has to wait until the FPGA finishes with the next one. Hence, the simulation time is dominated by the FPGA execution time as all the CPU Monte Carlo execution time is overlapped with the FPGA execution time except for the last grouped path. • FPGA Monte Carlo task faster that CPU Monte Carlo task. In this case, the FPGA has to be stalled until the software processes one grouped path and frees the memory zone where its data was allocated so the FPGA can return another groped path. Now all the FPGA execution time is overlapped with the CPU execution time except for the first grouped path. In our case study simulations and their software profiling, all the cap simulations correspond to the case where CPU has to wait for the FPGA while all the TRA simulations correspond to the case where FPGA is stalled waiting for the CPU to process the paths. Table 6.8 summarizes the theoretical maximum achievable speedups following equation ( 6.3). In both cases, when the number of nodes increase, the accelerated tasks represent a bigger part of the total execution time (see the (LMM+RNG)/Product column in Table 6.6), and hence the achievable speedup also increase. Meanwhile, for the TRA, the effects of Amdhal’s law are easily observed. The 6.5 Experimental Results 163 product valuation remaining in software requires a considerable part of the total execution time. In this way, although the LMM plus the RNG are highly accelerated, the total speedup is poor until the percentage of the time devoted to the product valuation decreases with the increase of the number of nodes, increasing the ratio between simulation times and payoff times. Comparing theoretical speedups for the total simulation (Table 6.8) with the ones corresponding to the partial tasks (Table 6.7) we can see that, when the payoffs execution time overlaps completely or almost completely with the FPGA execution time, the achievable speedup tends to the one of the migrated tasks thanks to the technique of using two memory zones to let software and hardware work in parallel.

6.5.3. Hardware-Software Solution Results

However, the theoretical approximation does not take into account several tasks related to the integra- tion of the hardware with the software. In previous sections, we have introduced the task related to the reordering of the data send by the FPGA. Another task, is the preparation and sending of all the initial data to the FPGA. To measure correctly the speedup it is necessary to run the application with the FPGA accelera- tor. Table 6.9 summarizes the profiling carried out (now some measures are shown in milliseconds). FPGA execution times cannot be directly measured as they overlap with some software tasks. In the table we can find the two new tasks, FPGA Pre which corresponds to the initial configuration of the FPGA with the market and initial data, and Reorder which corresponds to the software task needed to reorder the FPGA data. Finally, the Wait task is the task implemented to know in software when the FPGA data is ready. In this way, it represents all the FPGA execution time not overlapped with any other software task. As can be seen, the new tasks represent a very small overhead (less than 0.2 s in the worst case). With these results we can extrapolate the data to 100000 simulations (5000 grouped paths). Table 6.10 summarizes the speedup that are obtained. Comparing the obtained speedups with the ones we computed theoretically we can see that they are exactly the same. The two new tasks required for the hardware-software integration represent overlaps with the FPGA execution time (cap) or represents a so small overhead (TRA) that its effect is negligible. Other related issues, as the software handling of interrupts, are also overlapped with other executions times.

6.5.3.1. All in Hardware: Cap Valuation in Hardware

To test the LMM plus product valuation we have run a simulation with the cap product valuation in the LMM engine. In this case, instead of returning all the curve of interest rates only one value is 164 Hardware-Software Integration

Table 6.9: Hardware-Software Profiling.

Grouped Total Pre & Post FPGA pre Wait Reorder Product Nodes Paths Time (s) Process (s) (ms) (s) (ms) (s) 100 9.69 8.95 0.31 0.61 21.17 0.11 4 200 10.34 8.86 0.31 1.21 42.18 0.22 500 12.6 8.89 0.32 3.03 103.74 0.57 100 11.21 8.99 0.93 2.00 25.89 0.19 Cap 12 200 13.46 9.01 0.95 4.00 52.41 0.39 500 20.27 9.45 0.95 10.0 129.91 0.98 100 13.25 8.80 1.79 4.24 35.11 0.17 24 200 19.94 9.03 1.83 8.44 71.22 0.38 500 31.76 9.51 1.8 21.12 175.6 0.95 100 4.44 2.24 0.35 3.43 ms 31.43 2.15 4 200 6.63 2.23 0.26 3.44 ms 64.31 4.33 500 13.20 2.23 0.23 3.44 ms 158.23 10.80 100 4.48 2.25 0.66 9.90 ms 34.57 2.19 TRA 12 200 6.73 2.26 0.66 9.92 ms 69.34 4.38 500 13.48 2.37 0.66 9.91 ms 172.62 10.92 100 4.54 2.31 1.31 19.69 ms 40.37 2.16 24 200 6.88 2.44 1.28 19.68 ms 80.68 4.33 500 13.87 2.78 1.29 19.69 ms 198.76 10.87

Table 6.10: Extrapolated times and measured speedup Cap TRA Nodes 4 12 24 4 12 24 SW 622.18 1859.55 3730.50 413.12 986.76 1869.39 HW-SW 45.95 120.32 232.21 111.85 113.36 113.59 Speedup(x Times) 13.5 15.5 16.1 3.7 8.9 17.4 6.5 Experimental Results 165

Table 6.11: Resources of the Different Xilinx FPGA Families. Slices DSP BRAM V5-FX200 30,720 384 456 V6-SX550 89,520 864 1,264 V7-VX1140 178,000 1,880 3,360 returned per subpath, the NPV computed for the cap product for the corresponding subpath curve. As we have 20 Latin Hypercubes, now for each grouped path we return a vector of 20 NPVs. In this way the software only carries out a minimal part of the Monte Carlo simulation, the accumulation of the results. The speedup results obtained correspond exactly to the ones obtained when the accelerator is used without product valuation. This result was expected one once we analyzed the results obtained from the accelerator without the product valuation for this cap case study. The FPGA execution time was clearly the dominant time and thereby reducing the Monte Carlo software execution time does not improve the whole speedup. Now we can only outstand that the measured time for the wait time almost matches the theoretical FPGA execution time as the software Monte Carlo time was so negligible that we could not measured it (we are using timers having microsecond precision).

6.5.3.2. Results Extrapolation Following FPGA’s Roadmap

The FPGA that was chosen for prototyping this research work can only hold one LMM engine, as one engine requires more than 80% of all available resources. However, for newer FPGAs this circumstance changes. If we follow the technological roadmap of Virtex from Xilinx, two FPGA families have appeared after Virtex 5: Virtex 6 [Xild] and Virtex 7 [Xile]. Studying the different FPGA families, the Virtex 5 FX200 used in this work will correspond to a LX550 in Virtex 6 family and to a VX1140 in Virtex 7 family. In Table 6.11 the comparison of resources among the three FPGA families is summarized. As can be seen, for each new generation, FPGA resources have been roughly doubled with the use of smaller submicron technologies (65nm for Virtex 5, 40nm for Virtex 6 and 28nm for Virtex 7). Additionally, the capabilities of the Slices and the DSPs have also been increased (i.e. Virtex 5 slices are made of four LUTs and four Flip Flops, while in Virtex 6 and 7 the Flip Flops per slice have been increased to eight). In our case, this resouces increase implies two circumstances that could enhance the speedup of our accelerator. In first place, we could increase the application parallelism by implementing more than one LMM Engine in the same FPGA, and abstracting from the changes required in the communications infrastructure, we can assume that there will be no bottleneck in the PCI Express communications. In this way, several independent interest curves will be computed in parallel and 166 Hardware-Software Integration

Table 6.12: Advanced FPGAs extrapolation. Slices DSP BRAM MHz Slices LUTs Flip-Flops V6-SX550 (Synthesis) - 86871 23903 144 144 80.9 24.3% 3.3% 16.7% 7.7% V6-SX550 (Place & Route) 28176 81590 25909 131 144 61.0 31.5% 22.8% 3.6% 15.2% 7.7% V7-VX1140 (Synthesis) - 87085 23885 144 144 85.6 12.2% (1.7% 7.7% 4.3%

more than one LIBOR per cycle (from different curves) could be obtained with the corresponding improvement in the system throughput. This solution would affect to the cases where the FPGA accelerator was the slowest part in the parallel computation of the interest curves and the product valuation, as it is the case for the Cap product. However, for the TRA case, no speedup improvement would be observed. In second place, improvements in the Engine working frequency could be expected due to the use of a smaller submicron technologies. Again, these improvements only would affect to the Cap product case. In Table 6.12 the implementation results for one LMM Engine core are summarized for V6-SX550 and V7-VX1140. As can be seen for the Post Place & Route results for the V6-SX550, one core now only requires less than a third of the FPGA resources while working the frequency is improved a 22% with respect to the 50 MHz in Virtex 5. Hence, if these metrics do not worsen when new LMM cores are introduced we could instance up to three cores in the V6-SX550 FPGA with an improvement of 11 MHz in the clock frequency. To study the impact of these improvements we can focus on the 24 nodes Cap results. On the last column of Table 6.6 the FPGA time required was 222.5 seconds, that it now will be reduced to 60.79 seconds, saving 161.71 seconds. If we introduce these savings in the experimental measures done, Table 6.10, the HW-SW time will decrease to 67.56 seconds and the speedup will increase up to 52.9. For Virtex 7, this speedup will continue increasing as can be seen from the synthesis results, up to eight LMM Engines can be implemented in just one V7-VX1140. Given that in Post Place & Route slices usage is always bigger than the corresponding to the maximum between LUTs and Flip-Flops, we could consider six LMM Engines. In this case (and not considering the additional improvement in clock frequency, four MHz in synthesis results with respect to Virtex 6) the FPGA time will be reduced to 30.39 (and thereby, the simulation will continue being limited by the FPGA) obtaining a speedup of 93.01. 6.6 Conclusions 167

6.6. Conclusions

Integrating any hardware accelerator within a software application is a complex and delicate task that can lead to lose most of the achievable speedup when the accelerator is used. In this chapter we have studied the partitioning policy followed for the accelerator implementation focusing on the FPGA’s advantages and disadvantages. With have opted for a more reliable approach, although some performance is sacrificed, letting product valuation in software. Following the partitioning policy selected, we have developed the infrastructure (both hardware and software) required to make possible the integration of our accelerator within a software applica- tion. A mechanism, based on the use of two RAM memory zones and a PCI-E core with Bus Master capabilities in the FPGA, has been proposed and implemented. And it has allowed us to extend the intrinsic parallelism of Monte Carlo simulations to the real way the CPU and the FPGA work together. In this way, we exploit the CPU to work in parallel with the FPGA, overlapping their execution times. Hence, the software execution time affecting the performance is reduced to the initial and final processing and to the product valuation in case it is slower that LMM plus RNG in the FPGA. With this scheme we have achieved high speedups, around 18 times, and close to the theoretical limit for our cases: when there is no software ported to Hardware or when execution is overlapped by the FPGA execution (the LMM plus RNG achievable speedup). In this case, the achieved speedup could be considerably improved by using new FPGAs and several LMM cores in parallel. When the speedup is not limited by the software running in parallel, the achieved speedup is lim- ited by the working clock frequency of the LMM engine, as no significant communications overhead has been measured. This frequency is limited by the drift calculation and improving this critical path would improve the global speedup. Additionally, we can also exploit new FPGAs to implement sev- eral Engines in parallel increasing the number of LIBORs per cycle computed (from different curves) and, in this way, the global speedup.

7

Conclusions

The main topic of this Thesis is hardware acceleration and FPGAs. Nowadays, high performance computing is dominated by conventional multicore computers and clusters built with them. However, acceleration continues to be a great necessity. FPGAs are positioned as ideal candidates to provide acceleration thanks to their huge improvement in the last years and their expected technology road map. However, dealing with hardware acceleration and FPGAs presents plenty of difficulties that makes developing an FPGA hardware accelerator a highly challenging task. This is an extremely wide topic that makes impossible to study it in depth in just one Thesis. On the one hand, there exist great variety of applications suited for hardware acceleration and they are very diverse. On the other hand, hardware acceleration involves plenty of different and complex tasks. To accomplish an in depth and as general as possible research and analysis of hardware accelera- tion with FPGAs, we have focused on a complex subset of applications, Monte Carlo simulations, and in particular in financial Monte Carlo simulations. These simulations present many features that make them ideal candidates for our purpose: they involve complex mathematical operations, rely on specific tasks as random number generation, they require complex control and integration with software ap- 170 Conclusions plications, adaptation of software algorithms to specific hardware ones is needed, hardware-software partitioning has to be carried out, etc. In this Thesis we have opted to face up all these tasks from a practical point of view. In the first steps of this work, we have focused on the basic building blocks and We have carried out the whole design and implementation cycle of a complete hardware acceler- ator of one of the selected subset of applications, the LIBOR Market Model. The design and imple- mentation of a complete LMM accelerator, a big challenge itself, has allowed us to explore and study many other smaller FPGA challenges: due to its complexity, this model is a perfect benchmark as it comprises specific Monte Carlo and financial design challenges (Random Number Generation, com- plex equations controls, data feedbacks, etc) while it comprises many elements and features shared with many other applications (floating-point operators, accuracy required, methods and techniques for FPGAs, hardware-software integration, etc).

7.1. Contributions and Conclusions of this Thesis

In Chapter 1 we introduced the main objectives of this Thesis and the design challenges that have to be tackled when developing a hardware accelerator. Following, the main contributions of this Thesis are summarized related to the objectives we outstood.

7.1.1. Random Number Generators

The first objective we mentioned was to study the common elements that play a key role in Monte Carlo simulations and in our target application. One of the more important problems that FPGA designers have to deal with is the lack of good quality and fully characterized cores that can be reused allowing the designer to abstract from the particular implementation of all the cores required by an FPGA accelerator. Thereby, firstly we have focused on the key core for any Monte Carlo simulation, Random Num- ber Generators and four main points: uniform and gaussian random number generation, the inversion method with quintic Hermite interpolation for transforming uniform samples into samples of the tar- get distribution, and variance reduction techniques. With these four elements we have developed a complete Gaussian RNG which is parameterizable and integrates aa variance reduction techniques. Our contributions in the field are:

• A High performance Mersenne Twister uniform RNG. We have developed a MT URNG that outperforms previous works and implements the whole algorithm in an FPGA. • A new FPGA-oriented architecture for the MT URNG based on a circular buffer of shift- registers. 7.1 Contributions and Conclusions of this Thesis 171

• An FPGA high quality splines approximation of Gaussian ICDF. We have developed methods and techniques (as the non-multicycle search algorithm) to make possible a high quality Quintic hermite interpolation of the ICDF specially suited for hardware acceleration. With our approx- imation we can see that high quality generators based on complex mathematical equations are reliable when implemented in FPGAs.

• The developed architecture for this Quintic Hermite interpolation can be reused for any other distribution approximation.

• A parameterizable variance reduction core with Latin Hypercube. We have studied and im- plemented a Latin Hypercube core for uniform samples that can be integrated within random number generators

• A high-quality high-performance Gaussian RNG which is compatible with variance reduction techniques. All previous elements are integrated together to implement a Gaussian RNG based on the inversion method. Additionally it can be parameterized to modify the mean and the deviation and even convert it into a Log-Normal RNG.

These contributions can be seen as a library of elements for Monte Carlo random number gener- ation with an FPGA. They have been developed with two goals, reusability and performance. Hence, their use is not restricted to just our target application and can be considered as independent cores that can be used as a black box core in any implementation. Related to performance, the results show the important speedup factors (up to 24 times) that can be achieved with hardware RGNs even for just uniform RNG. Thus, RNGs are ideal candidates for hardware acceleration. And the can be used in two ways: by themselves or integrated in a more complex accelerator.

7.1.2. Floating-Point Arithmetic Operators and FPGAs

The second key element for most Monte Carlo applications (and for many other applications) are the mathematical operators as they are the basic elements to implement the equations of the models. Our objective in this field was to study how the format can be adapted in FPGAs to reduce its complexity, requiring less resources per operator while enhancing their performance. For this study we have developed four libraries with different features related to the format (different set of format simplifications). The contributions of this study are the following:

• An in-depth study and analysis of the impact of different design decisions on the format on floating-point operators. This analysis apart from the implementation results, an analysis of the impact on the accuracy and resolution of the different features with respect to the floating-point standard. 172 Conclusions

• A set of four single precision and high performance libraries composed of six operators: addi- tion, multiplication, division, square root, exponential function and logarithm function. • An analysis of what design decisions have best performance-accuracy-standard trade-offs and how these design decisions have to be implemented to have the minimum standard-accuracy impact. The design decisions evaluated are: handling denormalized numbers as zeros, just allowing truncation rounding and hardware representation for the type of number. • Two final libraries following this last analysis which are specifically suit for FPGAs and high performance. These two libraries include hardware representation for the type of number, de- normalized numbers handled as zeros, and only implements round to nearest.

From the study we can conclude that the format overhead has a major impact on resources and perfromance, and reducing it is a must to obtain operators suited for FPGAs. In particular, the han- dling of denormalized numbers has a major impact on the FPGA operators while the use of hardware representation also saves logic with no associated trade-off. A second objective on this field was to study the capabilities of current FPGAs to implement data changes involving many operators. We have carried out a second analysis, to identify the number of operators that can be implemented in an FPGA and to identify how the performance of the operators is affected when they are not isolated. Our contributions in this fields is the following:

• A theoretical and experimental study of the replicability. We have analyzed the number of operators of each type that can be implemented in a target FPGA while characterizing how the introduction of more operators affects to the performance and the resources usage.

This study shows the huge capabilities of current FPGAs which allow up to hundreds of single floating-point operators. However, it also demonstrates how the working frequency of the operators is severely affected by the routing of their elements when the operators are not isolated and a high percentage of the resources of an FPGA are used. Finally, we have faced a third objective in the field of floating-point operators and FPGAs, the development of an accurate exponentiation function taking advantage of FPGA flexibility. The avail- ability of elementary functions for FPGAs is essential for developing hardware co-processors. We have identified the exponentiation function as a complex operator required in some applications but without a complete implementation in the previous literature. In this area, we can resume the contri- butions in the following points:

• A study of the accuracy effects that the straightforward implementation of the exponentiation function has. A general error analysis has been carried out to study the error propagation within the sub-operators of this implementation. • An architecture for accurate exponentiation functions and FPGAs based on three chained sub- operators and an exception unit to implement the three different functions provided by the 7.1 Contributions and Conclusions of this Thesis 173

standard for the exponentiation function. Tailored precision for each sub-operator is a key point in this architecture with a specific error analysis related to the features of each sub-operator. • Integration of the developed architecture within the FloPoCo tool. The developed architecture has been integrated into this tool to automatize the generation of exponentiation operators with variable precisions.

With this work we have seen that very complex operators as the exponentiation can be imple- mented in an FPGA, although requiring a high amount of resources and large pipelined datapaths. Moreover, FPGAs flexibility allows tailored implementations to overcome the problem of error prop- agation within the exponentiation function.

7.1.3. LMM Hardware Accelerator

One of the research objectives of this Thesis has been oriented to guess if FPGAs are capable ac- celerators for Monte Carlo simulations on the one hand, and if that was the case, if they are good accelerators for these particular applications. A complex financial simulation has been selected for this purpose, the LIBOR Market model. Its complexity (complex equations, high accuracy requirements, complex control, datafeedbacks, etc) has allowed us to face many issues related to hardware acceleration through the design and imple- mentation of an FPGA core for this model. Next, the main contributions in this area:

• Analysis of the selected application from the perspective of an FPGA implementation and hard- ware acceleration. We have analyzed the equations of the model to identify how it can be im- plemented in an FPGA, to identify which are the elements that imply restrictions, and finally to identify how the model has to be adapted to an FPGA. • Design of a complete hardware accelerator for the LIBOR Market Model. This implementation is based on the previous model analysis and a complex control to ensure the maximum possible performance. In this design, all the previous developed cores have been integrated. • Adaptation and integration of the elementary elements for a real application. The previously designed cores have been adapted for specific requirements of the target application. • Study of the tailored precision required by the implementation to ensure high accurate results. An experimental methodology has been followed to determine which precision fulfills the ac- curacy requirements.

A complete hardware accelerator has been designed, implemented and integrated within a real application proving the feasibility of using FPGAs for hardware acceleration in the studied applica- tions. To obtain valid conclusions, we have opted for a real application without any simplifications, dealing with all the problems and constraints of a complex model. Finally, we have validated that FPGAs are capable of implementing complex applications as financial simulations. 174 Conclusions

7.1.4. Capacity and Performance of FPGAs. Accelerator design

When this Thesis started we had several main concerns about the FPGA capacities and the selected model: will an FPGA be able of implementing a complete complex model as the LMM? Will it be possible to migrate all the model components to a hardware accelerator? If it were possible, which will the speedup results be? These issues have been resolved, on the one hand thanks to the technological improvements that FPGAs have experienced, as the implementation of the accelerator would have not been possible without the upgrade to an FPGA family that did not exist when the Thesis started. On the other hand, the whole application design has been carried out to minimize the required resources while improving the performance, while exhaustive analysis of methods and techniques has allowed to implement all the required cores. Following these ideas, we find the use of extended precision floating-point operators and the precision-accuracy study, how the LMM model has been adapted to FPGAs, or the whole parallelization of the hardware oriented to obtain one LIBOR com- puted per cycle on the side of performance and resources. On the side of methods and techniques, we can outstand the gaussian RNG, the exponentiation operator or the whole integration within the different cores. Summing up, we have achieved an accelerator that fits in a Virtex 5 FX200 and whose LMM Engine working clock cycle is high enough to achieve a considerable speedup of the migrated tasks although the model data dependencies make impossible using deeply pipelined operators.

7.1.5. Hardware-Software co-design and Integration

One last contribution of this Thesis has been the final integration and test of the accelerator within a real software application and the analysis of its implications. Practical issues like this are not usually studied in a research stage, however, the design of any accelerator is incomplete if the tasks related to its integration within a software application are not carried out. Additionally this integration can have a major impact on the real results achieved and is required for a complete validation of the developed hardware. In our case, we had to develop almost all the communications infrastructures and an adaptation of the software has been required for the correct use of the accelerator within the software application. Our main contributions are the following:

• We have analyzed how the target application has to be split between hardware and software. A partitioning policy based on task stabilities has been followed. • We have developed the infrastructure (both hardware and software) required to make possible the integration of our accelerator within a software application. A mechanism, based on the use of two RAM memory zones and a PCI-E core with Bus Master capabilities in the FPGA, has been proposed and implemented. And it has allowed us to extend the intrinsic parallelism of 7.2 Future Lines of Work 175

Monte Carlo simulations to how the CPU and the FPGA work together. • A general hardware infrastructure has been developed to isolate the application engine from the communication core based on the use of an interface adapter. This allows to develop a general communications infrastructure to be reused for other applications. • We have carried out an in-deep profiling of the application, both in software and hardware- software implementations. • We have carried out an extrapolation of the results obtained to newer FPGAs.

The implemented solution, has allowed us to exploit the CPU in parallel with the FPGA, over- lapping their execution times. In this way and for our benchmark application, the software execution time affecting the performance is reduced to the initial and final processing and to the product valua- tion in case it is slower that LMM plus RNG in the FPGA. With this scheme we have achieved high speedups, around 18 times. This speedup is mainly determined by the datafeedbacks in the model. However, we can considerably improve that speedup by using newer FPGAs and parallelizing LMM cores within the same FPGA.

7.2. Future Lines of Work

The extension of the work presented in this Thesis has only made possible to develop what we con- sider is a first version of all the work carried out here. This means we have identified plenty of actions points that can be carried out, including both improvements and new research lines.

7.2.1. Research lines related to Improvements

During the development of the Thesis we have identified several improvements that can be carried out to improve both the performance and the quality of the accelerator:

• LMM engine working frequency. • Gaussian ICDF precision.

Firstly, the achievable speedup is clearly limited by the LMM engine working frequency. At the same time this working frequency is determined by the data dependencies found in the implemented model and particularly in the accumulator adders. To improve it, two research lines can be followed. The first option is the redesign of the whole LMM Core to make possible the pipelining of the accu- mulators. This could be possible if the simulation order is changed so the same LIBOR for all the subpaths is computed before continuing with the next LIBOR, see Section 5.3.1. However, it would imply deep changes in the control and in the architecture while it would make the adder pipelining dependant on the number of Latin Hypercubes. 176 Conclusions

The second option is to redesign the accumulator to be faster. One option could be to carry out a complete profile of the precision and accuracy required for the accumulators in order to make possible the use of a faster and simpler arithmetic as could be fixed-point one just for those operations. The second main improvement should be the precision of the ICDF of the gaussian generator. This generator provides floating-point precision samples. Extending its precision can imply new research lines due to the difficulty of interpolating the ICDF function in the tails.

7.2.2. New Research Lines

Additionally to the improvements, new research lines can be followed taking as base the work carried out and how quickly the related state of the art advances:

• Product Valuation core. • New FPGAs with new capabilities. • Integration of the developed cores within design tools to facilitate their reuse. • New issues related to LMM. • Library of random generators based on Quintic Hermite Interpolation.

The first one, could be to study in depth sets of different products valuations to analyze the fea- sibility of developing a modular architecture with a pool of computational resources and submodules which combined with each other can lead to the implementation of different products. Second, FPGAs continued advances make possible new solutions. As seen in the results section of Chapter 6, the complete core requires 88% of the resources of the prototyping FPGA. However, nowadays we can find new FPGA families which present much more resources and higher perfor- mance as Virtex 6 and Virtex 7. As we saw in Section 6.5.3.2 the speedup achieved could be im- proved by the intrinsic improvement in speed of this FPGAs and mainly by the implementation of LMM cores in parallel within the same FPGA. This opens new research lines to explore the use of several LMM engines in the same FPGA, sharing between them common resources, and developing a communications infrastructure to work with several cores. Third, as mentioned above in the contributions, we have developed many cores that can be directly used or easily adapted to other applications. Another research line can be the integration of these cores within existing CAD tools or the development of a new CAD tool oriented to the use of these (or other cores) focused on abstracting from the specific implementation of the cores. Another research line is to continue exploring the use of the LMM for other issues. In this The- sis we have just focused on the generation of the interest rates curve and in the product valuation. However, there are other related tasks to the Monte Carlo LMM that we have not explored, like the computation of the greeks: the sensitivities of the price of derivatives to changes in underlying parameters on which the value of an instrument or a portfolio of financial instruments is dependent. 7.2 Future Lines of Work 177

Finally, one last research line is to extend the architecture and the methods used to implement the gaussian ICDF to other distributions by means of Quintic Hermite interpolation and the inverse CDF.

Bibliography

[AAS+07] S.R. Alam, P.K. Agarwal, M.C. Smith, J.S. Vetter, and D. Caliga. Using FPGA devices to accelerate biomolecular simulations. Computer, 40(3):66 –73, March 2007.

[Alta] Altera Corporation. Altera floating point megafunctions. http://www.altera.com/products/ip/dsp/arithmetic/m-alt-float-point.html.

[Altb] Altera Corporation. Stratix IV documentation. http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp.

[Alt08a] Altera. Floating Point Exponent (ALTFP_EXP) Megafunction User Guide, 2008.

[Alt08b] Altera. Floating Point Natural Logarithm (ALTFP_LOG) Megafunction User Guide, 2008.

[Amd67] Gene Amdahl. Validity of the single processor approach to achieving large-scale com- puting capabilities. In AFIPS Conference, pages 483–485, 1967.

[BD76] Carter Bays and S. D. Durham. Improving a poor random number generator. ACM Transactions on Mathematical Software, 2(1):59–64, March 1976.

[BdDPT10] Sebastian Banescu, Florent de Dinechin, Bogdan Pasca, and Radu Tudoran. Multi- pliers for floating-point double precision and beyond on FPGAs. In Highly-Efficient Accelerators and Reconfigurable Technologies, 2010. 180 BIBLIOGRAPHY

[BDM08] Nicolas Brisebarre, Florent Dinechin, and Jean-Michel Muller. Integer and Floating- Point Constant Multipliers for FPGAs. In 19th IEEE Conference on Application- Specific Systems, Architectures and Processors (ASAP’2008), July 2008.

[BFS83] Paul Bratley, Bennett L. Fox, and Linus E. Schrage. A Guide to Simulation. Springer- Verlag, 1983.

[BGM97] A. Brace, D. Gatarek, and M. Musiela. The market model of interest rate dynamics. Mathematical Finance, (7):127–155, 1997.

[BM58] G. E. P. Box and Mervin E. Muller. A note on the generation of randon normal deviates. The Annals of Mathematical Statistics, 29(2):610–611, June 1958.

[BS73] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. The Journal of Political Economy, 81(3):637–654, June 1973.

[CA07] Stephen Craven and Peter Athanas. Examining the viability of FPGA supercomputing. EURASIP J. Embedded Syst., 2007(1):8, 2007.

[CA08] Shrutisagar Chandrasekaran and Abbes Amira. High performance FPGA implemen- tation of the Mersenne Twister. In Proceedings of the International Symposium on Electronic Design, Test & Applications, pages 482–485, 2008.

[Cap] Interest rate cap. http://en.wikipedia.org/wiki/Interest_rate_cap.

[CHL08] M. Chiu, M.C. Herbordt, and M. Langhammer. Performance potential of molecular dy- namics simulations on high performance reconfigurable computing systems. In Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications, 2008. HPRCTA 2008., pages 1–10, Nov. 2008.

[Cor] Intel Corporation. Intel i7 920 documentation. http://ark.intel.com/products/37147?wapkw=920.

[DCPS07] Omkar Dandekar, Carlos Castro-Pareja, and Raj Shekhar. FPGA-based real-time 3D image preprocessing for image-guided medical interventions. Journal of Real-Time Image Processing, 1(4):1057–1071, 2007.

[dD10] Florent de Dinechin. A flexible floating-point logarithm for reconfigurable computers. Lip research report rr2010-22, ENS-Lyon, 2010.

[DdD] Jérémie Detrey and Florent de Dinechin. A VHDL library of parametris- able floating-point and LNS operators for FPGA. http://www.ens- lyon.fr/LIP/Arenaire/Ware/FPLibrary/. BIBLIOGRAPHY 181

[DdD05a] Jérémie Detrey and Florent de Dinechin. A parameterized floating-point exponential function for FPGAs. In IEEE International Conference Field-Programmable Technol- ogy, pages 27–34, 2005.

[DdD05b] Jérémie Detrey and Florent de Dinechin. A parameterized floating-point logarithm operator for FPGAs. In Signals, Systems and Computers, 2005. Conference Record of the Thirty-Ninth Asilomar Conference, pages 1186–1190, 2005.

[DdD07] Jérémie Detrey and Florent de Dinechin. Parameterized floating-point logarithm and exponential functions for FPGAs. Microprocessors and Microsystems, Special Issue on FPGA-based Reconfigurable Computing, 31(8):537–545, 2007.

[dDDCT08] Florent de Dinechin, Jérémie Detrey, Octavian Cret, and Radu Tudoran. When FPGAs are better at floating-point than microprocessors. In FPGA ’08: Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays, pages 260–260, New York, NY, USA, 2008. ACM.

[DdDP07a] Jérémie Detrey, Florent de Dinechin, and Xabier Pujol. Return of the hardware floating- point elementary function. In Symposium on Computer Arithmetic, pages 161–168, 2007.

[DdDP07b] Jérémie Detrey, Florent de Dinechin, and Xavier Pujol. Return of the hardware floating- point elementary function. In 18th Symposium on Computer Arithmetic, pages 161– 168. IEEE, 2007.

[dDJP10] Florent de Dinechin, Mioara Joldes, and Bogdan Pasca. Automatic generation of polynomial-based hardware architectures for function evaluation. In Application- specific Systems, Architectures and Processors. IEEE, 2010.

[dDKP09] Florent de Dinechin, Cristian Klein, and Bogdan Pasca. Generating high-performance custom floating-point pipelines. In Field Programmable Logic and Applications, pages 59–64. IEEE, August 2009.

[dDP] Florent de Dinechin and Bogdan Pasca. FloPoCo webpage. http://flopoco.gforge.inria.fr/.

[dDP10] Florent de Dinechin and Bogdan Pasca. Floating-point exponential functions for DSP- enabled FPGAs. Lip research report rr2010-23, ENS-Lyon, 2010.

[DEH89] Randall L. Dougherty, Alan Edelman, and James M. Hyman. Nonnegative-, monotonicity-, or convexity-preserving cubic and quintic interpolation. Mathematics of Computaion, 52(186):471–494, April 1989. 182 BIBLIOGRAPHY

[DR04] Christopher C. Doss and Robert L. Riley. FPGA-Based implementation of a robust IEEE-754 exponential unit. In IEEE Field-Programmable Custom Computing Ma- chines, pages 229–238, 2004.

[EH09] Greg Edvenson and Mark Hur. Accelerating bioinformatics searching and dot plotting using a scalable FPGA cluster. Pico Computing White paper, November 2009.

[ELV07] Pedro Echeverría and Marisa López-Vallejo. FPGA gaussian random number gener- ator based on quintic hermite interpolation inversion. In IEEE International Midwest Symposium on Circuits and Systems, 2007.

[ELV08] Pedro Echeverría and Marisa López-Vallejo. An FPGA implementation of the power- ing function with single precision floating point arithmetic. In 8th Conference on Real Numbers and Computers, pages 17–26, 2008.

[ELV11a] Pedro Echeverría and Marisa López-Vallejo. FPGA acceleration of Monte Carlo-based financial simulation: Design challenges and lessons learnt. In DATE Workshop: Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.

[ELV11b] Pedro Echeverría and Marisa López-Vallejo. Customizing floating-point units for FP- GAs: Area-performance-standard trade-offs. Microprocessors and Microsystems - Em- bedded Hardware Design, 35(6):535–546, 2011.

[ELVLB08] Pedro Echeverría, Marisa López-Vallejo, and Carlos López-Barrio. Inversion-based FPGA random number generation using quintic hermite interpolation. In Design of Circuits and Integrated Systems Conference, 2008.

[ELVP08] Pedro Echeverría, Marisa López-Vallejo, and José María Pesquero. Variance reduction techniques for monte carlo simulations. a parameterizable FPGA approach. 2008.

[Ete11] Aharon Etengoff. IBM and ARM design next-gen mobile chips. TGDaily, www.TGDaily.com, January 2011.

[ETLVL08] Pedro Echeverría, D. B. Thomas, Marisa López-Vallejo, and Wayne Luk. An FPGA run-time parameterisable log-normal random number generator. 2008.

[Fel11] Maichael Feldman. http://www.hpcwire.com/hpcwire/2011-07- 13/jp_morgan_buys_into_fpga_supercomputing.html, 2011.

[FFM+10] Viviana Fanti, Giovanna Rosa Fois, Roberto Marzeddu, Callisto Pili, Paolo Randaccio, Sabyasachi Siddhanta, Jenny Spiga, and Artur Szostak. A dedicated processor for Monte Carlo computation in radiotherapy. In IEEE Nuclear Science Symposium, pages 1843–1837, 2010. BIBLIOGRAPHY 183

[Fre61] C.V. Freiman. Statistical analysis of certain binary division algorithms. Proceedings of the IRE, 49(1):91–103, Jan. 1961.

[Gen98] James E. Gentle. Random Number Generation and Monte Carlo Methods. Springer- Verlag, 1998.

[Gla04] Paul Glasserman. Monte Carlo Methods in Financial Engineering. Springer, 2004.

[Glo] HiTech Global. http://www.hitechglobal.com/boards/pciexpresslx330t.htm. Xilinx Virtex FX200T Platform.

[GMDH08] Nicolas GAC, Stéphane Mancini, Michel Desvignes, and Dominique Houzet. High speed 3D tomography on CPU, GPU, and FPGA. EURASIP Journal on Embedded Systems, 2008.

[Gra] Mentor Graphics. http://www.mentor.com/esl/catapult/overview.

[GSP05] Gokul Govindu, Ronald Scrofano, and Viktor Prasanna. A library of parameterizable floating-point cores for FPGAs and their application to scientific computing. In Inter- national Conference on Engineering of Reconfigurable Systems and Algorithms, 2005.

[Hau98] S. Hauck. The roles of FPGAs in reprogrammable systems. Proceedings of the IEEE, 86(4):615 –638, Apr 1998.

[HFMF99] Patrick Hung, Hossam Fhmy, Oskar Mencer, and Michael J. Flynn. Fast division algo- rithm with a small lookup table. In 33rd Asilomar Conference on Signals, Systems and Computersc, pages 24–27, 1999.

[HL03] W. Hörmann and J. Leydold. Continuos random variate generation by fast numerical inversion. ACM Transactions on Modeling and Computer Simulation, 13(4):347–362, Octuber 2003.

[HU07] K. Scott Hemmert and Keith D. Underwood. Floating-point divider design for FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(1):115–118, January 2007.

[HVG+07a] Martin C. Herbordt, Tom VanCourt, Yongfeng Gu, Bharat Sukhwani, Al Conti, Josh Model, and Doug DiSabello. Achieving high performance with FPGA-based comput- ing. Computer, 40(3):50–57, 2007.

[HVG+07b] Martin C. Herbordt, Tom VanCourt, Yongfeng Gu, Bharat Sukhwani, Al Conti, Josh Model, and Doug DiSabello. Achieving high performance with fpga-based computing. Computer, 40:50–57, 2007. 184 BIBLIOGRAPHY

[IEE85] IEEE Standard Board. IEEE standard for binary floating-point arithmetic. The Institute for Electrical an Electronics Engineers, 1985.

[IEE08] IEEE Computer Society. IEEE Standard for Floating-Point Arith- metic. IEEE Standard 754-2008, August 2008. available at http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.

[Ito51] Kiyoshi Ito. On stochastic differential equations. Memoirs of the American Mathemat- ical Society, (4):1–51, 1951.

[Jae02] P. Jaeckel. Monte Carlo Methods in Finance. Wiley, 2002.

[KAB+03] Nam Sung Kim, Todd Austin, David Blaauw, Trevor Mudge, Krisztián Flautner, Jie S. Hu, Mary Jane Irwin, Mahmut Kandemir, and Vijaykrishnan Narayanan. Leakage current: Moore’s law meets static power. Computer, 36(12):68–75, 2003.

[KCL08] A. Kaganov, P. Chow, and A. Lakhany. FPGA acceleration of Monte-Carlo based credit derivative pricing. In Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pages 329 –334, Sept. 2008.

[KD07] Nachiket Kapre and André DeHon. Optimistic parallelization of floating-point accu- mulation. In 18th IEEE Symposium on Computer Arithmetic, pages 205–216, 2007.

[KR07] I. Kuon and J. Rose. Measuring the gap between FPGAs and ASICs. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(2):203–215, Feb. 2007.

[LB02] Barry Lee and Neil Burgess. Parameterisable floating-point operations on FPGA. In Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, volume 2, pages 1064–1068, Nov. 2002.

[LC97] Yamin Li and Wanming Chu. Implementation of single precision floating point square root on FPGAs. In IEEE Symposium on FPGA-Based Custom Computing Machines, page 226, 1997.

[LCVL06] Dong-U Lee, Ray C.C. Cheung, John Villasenor, and Wayne Luk. Inversion-based hardware gaussian random number generator: A case study of function evaluation via hierarchical segmentation. In Field Programmable Technology, pages 33–39, 2006.

[L’E88] Pierre L’Ecuyer. Efficient and portable combined random number generators. Commu- nications of the Association for Computing Machinery, 31(6):742–749, June 1988.

[L’E96] Pierre L’Ecuyer. Maximally equidistributed combined tausworthe generators. Mathe- matics of Computation, 65(213):203–213, January 1996. BIBLIOGRAPHY 185

[L’E97] Pierre L’Ecuyer. Uniform random number generators: A review. In Winter simulation Conference, pages 127–134, 1997.

[Leh51] D. H. Lehmer. Mathematical methods in large-scale computing units. In Second Sym- posium on Large Scale Digital Computing Machinery, pages 141–146, 1951.

[LH02] J. Leydold and W. Hörmann. UNU.RAN-A library for non-uniform universal random variable generation. A-1090, Institut für Statistick, WU Wien, Austria, 2002.

[LLK+06] Julien Langou, Piotr Luszczek, Jakub Kurzak, Alfredo Buttari, and Jack Dongarra. Ex- ploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). LAPACK Working Note, July 2006.

[LLVC06] Dong-U Lee, Wayne Luk, John Villasenor, and Peter Y.K. Cheung. A hardware gaus- sian noise generator using the Box-Muller method and its error analysis. IEEE Trans- actions on Computers, 55:659–671, 2006.

[LLZ+05] Dong-U Lee, Wayne Luk, Guanglie Zhang, Philip H.W. Leong, and John Villasenor. A hardware gaussian noise generator using the Wallace method. IEEE Transactions on VLSI Systems, 13:911–920, 2005.

[LRL+09] William Chun Yip Lo, Keith Redmont, Jason Luu, Paul Chow, Jonathan Rose, and Lothar Lilge. Hardware acceleration of a Monte Carlo simulation for photodynamic therapy treatment planning. Journal of Biomedical Optics, 14(1):1–11, 2009.

[MA07] G.W. Morris and M. Aubury. Design space exploration of the european option bench- mark using hyperstreams. In Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on, pages 5 –10, Aug. 2007.

[Mer73] Robert C. Merton. Theory of rational option pricing. The Bell Journal of Economics and Management Science, 4(1):141–183, 1973.

[MFHAR10] Daniel M.Muñoz, Diego F.Sanchez, Carlos H.Llanos, and Mauricio Ayala-Rincón. Tradeoff of FPGA design of a floating-point library for arithmetic operators. Journal Integrated Circuits and Systems, 5(1):42–52, 2010.

[MN98] Makoto Matsumoto and Takuji Nishimura. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Mod- eling and Computer Simulation, 8(1):3–30, 1998.

[Mor83] M. Mori. A method for evaluation of the error function of real and complex variable with high relative accuracy. Publ. Res. Inst. Math. Sci., 19:1081–1094, 1983. 186 BIBLIOGRAPHY

[MU49] Nicholas Metropolis and Stanislaw Ulam. The Monte Carlo method. Journal of the American Statistical Association, 44(147):335–341, September 1949.

[Mul06] Jean-Michel Muller. Elementary Functions, Algorithms and Implementation. Birkhäuser, 2nd edition, 2006.

[NBK+09] L. Narasimhan, Pascal Boulet, Bogdan Kuchta, Oliver Schaef, Renaud Denoyel, and Philippe Brunet. Molecular simulations of water and paracresol in MFI zeolite - a Monte Carlo study. Langmuir, 25(19):11598–11607, 2009.

[OF97] S. F. Oberman and M. J Flynn. Division algorithms and implementation. IEEE Trans- actions on Computers, 46(8):833–854, 1997.

[PBM01] J. A. Piñeiro, J. D. Brugera, and J. M. Muller. FPGA implementation of a faithful poly- nomial approximation for powering function computation. In Euromicro Symposium on Digital System Design, pages 262–269, 2001.

[PEB04] J. A. Piñeiro, M. D. Ercegovac, and J. D. Bruguera. Algorithm and architecture for logarithm, exponential and powering computation. IEEE Transactions on Computers, 53(9):1085–1096, 2004.

[PF06] Alexander S. Pasciak and John R. Ford. A new high speed solution for the evaluation of Monte Carlo radiation transport computations. IEEE Transactions on Nuclear Science, 53(2):491–499, april 2006.

[PHT+07] Wan-Kai Pang, Shui-Hung Hou, Marvin D. Troutt, Wing-Tong Yu, and Ken W. K. Li. A Markov chain Monte Carlo approach to estimate the risks of extremely large insurance claims. International Journal of Business and Economics, 6(3):225–236, 2007.

[PS] PCI-SIG. PCI-Express webpage. http://www.pcisig.com/specifications/pciexpress.

[PTVF88] William H. Press, Saul A. Tekolsky, William T. Vetterling, and Brian P. Flannery. Nu- merical Recipies in C. 1988.

[Ran] Range accrual. http://en.wikipedia.org/wiki/Range_accrual.

[RHS+08] Christopher I. Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, and Wen- Mei W. Hwu. GPU acceleration of cutoff pair potentials for molecular modeling appli- cations. In CF ’08: Proceedings of the 5th conference on Computing frontiers, pages 273–282, New York, NY, USA, 2008. ACM.

[SEMLV08] Miguel Angel Sánchez, Pedro Echeverría, Francisco Mansilla, and Marisa López- Vallejo. Designing highly parameterized hardware using xhdl. In Forum on specifi- cation and Design Languages, pages 78–83, 2008. BIBLIOGRAPHY 187

[SGTP08] R. Scrofano, M.B. Gokhale, F. Trouw, and V.K. Prasanna. Accelerating molecular dy- namics simulations with reconfigurable computers. Parallel and Distributed Systems, IEEE Transactions on, 19(6):764–778, June 2008.

[SK06] Vinay Sriram and David Kearney. An area time efficient field programmable Mersenne Twister uniform random generator. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, pages 244–246, 2006.

[Tan89] P.T.P. Tang. Table-driven implementation of the exponential function in IEEE floating- point arithmetic. ACM Transactions on Mathematical Software, 15(2):144–157, 1989.

[Tan90] P.T.P. Tang. Table-driven implementation of the logarithm function in IEEE floating- point arithmetic. ACM Transactions on Mathematical Software, 16(4):378–400, 1990.

[Tau65] Robert C. Tausworthe. Random number generators by linear recurrence modulo two. Mathematics of Computation, 19(20):201–209, April 1965.

[TB09] Xiang Tian and Khaled Benkrid. Mersenne Twister random number generation on FPGA, CPU and GPU. In Proceedings of 2009 NASA/ESA Conference on Adaptive Hardware and Systems, pages 460–463, 2009.

[Tec] Impulse Accelerated Technologies. Impulse Accelerated Technologies webpage. http://www.impulseaccelerated.com/.

[TL07] David B. Thomas and Wayne Luk. High quality uniform random number generation using LUT optimised state-transition matrices. Journal of VLSI Signal Proccessing, 47:77–92, 2007.

[TL10] David B. Thomas and Wayne Luk. Fpga-optimised uniform random number generator using LUTs and shift registers. In International Conference on Field Programmable Logic and Applications, pages 77–82, 2010.

[TTTL10] Anson H.T. Tse, David B. Thomas, K.H. Tsoi, and Wayne Luk. Reconfigurable control variate Monte-Carlo designs for pricing exotic options. International Conference on Field Programmable Logic and Applications, 0:364–367, 2010.

[Ubu] Ubuntu. Ubuntu webpage. http://www.ubuntu.com/.

[Und04] Keith Underwood. FPGAs vs. CPUs: trends in peak floating-point performance. In FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, pages 171–180, New York, NY, USA, 2004. ACM. 188 BIBLIOGRAPHY

[WAG09] Zhigang Wu, Mark D. Allendorf, and Jeffrey C. Grossman. Quantum Monte Carlo sim- ulation of nanoscale MgH2 cluster thermodynamics. Journal of the American Chemical Society, 131(39):13918–13919, 2009.

[Wal96] C. S. Wallace. Fast pseudorandom generators for normal and exponential variates. ACM Trans. Math. Softw., 22(1):119–127, 1996.

[WBL06] Xiaojun Wang, Sherman Braganza, and Miriam Leeser. Advanced components in the variable precision floating-point library. In IEEE Field-Programmable Custom Com- puting Machines, pages 249–258, 2006.

[WSM01] Kent E. Wires, Michael J. Schulte, and Don McCarley. FPGA Resource Reduction Through Truncated Multiplication. In Field-Programmable Logic and Applications, pages 574–583. Springer-Verlag, 2001.

[WV08] N.A. Woods and T. VanCourt. FPGA acceleration of quasi-Monte Carlo in finance. Field Programmable Logic and Applications, 2008. FPL 2008. International Confer- ence on, pages 335–340, Sept. 2008.

[Xila] Xilinx. Bus Master Performance Demonstration Refer- ence Design for the Xilinx Endpoint PCI Express Solution. http://www.xilinx.com/support/documentation/application_notes/xapp1052.pdf.

[Xilb] Xilinx. Floating-point operator. http://www.xilinx.com/products/intellectual- property/FLOATING_PT.htm.

[Xilc] Xilinx. LogiCORE IP Endpoint Block Plus v1.14 for PCI Express. http://www.xilinx.com/support/documentation/ip_documentation/pcie_blk_plus_ug341.pdf.

[Xild] Xilinx. Virtex 6 documentation. http://www.xilinx.com/products/silicon- devices/fpga/virtex-6/index.htm.

[Xile] Xilinx. Virtex 7 documentation. http://www.xilinx.com/products/silicon- devices/fpga/virtex-7/index.htm.

[Xilf] Xilinx. Virtex iv documentation. http://www.xilinx.com/support/documentation/virtex- 4.htm.

[YTOS11] Hiroshi Yuasa, Hiroshi Tsutsui, Hiroyuki Ochi, and Takashi Sato. A fully pipelined implementation of Monte Carlo based SSTA on FPGA. International Symposium of Quality Electronic Design, pages 1–6, 2011. BIBLIOGRAPHY 189

[ZLHea05] G. L. Zhang, P. H. W. Leong, C. H. Ho, and K. H. Tsoi et al. Reconfigurable accelera- tion for Monte Carlo based financial simulation. In IEEE International Conference on Field-Programmable Technology, pages 215–222, 2005.

[ZLL+05] Guanglie Zhang, Philip H.W. Leong, Dong-U Lee, John Villasenor, Ray C.C. Cheung, and Wayne Luk. Ziggurat-based hardware gaussian random number generator. In IEEE International Conference Field-Programmable Logic and its Applications, pages 275– 280, 2005.

[ZP08] Ling Zhuo and V.K. Prasanna. High-performance designs for linear algebra operations on reconfigurable hardware. Computers, IEEE Transactions on, 57(8):1057–1071, Aug. 2008.