UNIVERSIDADPOLITÉCNICADEMADRID
ESCUELATECNICASUPERIOR DEINGENIEROSDETELECOMUNICACIÓN
HARDWARE ACCELERATION OF MONTE CARLO-BASED SIMULATIONS
TESIS DOCTORAL
PEDRO ECHEVERRÍA ARAMENDI
INGENIERO EN TELECOMUNICACIÓN
2011 DEPARTAMENTO DE INGENIERÍA ELECTRÓNICA
ESCUELATECNICASUPERIORDEINGENIEROSDE TELECOMUNICACIÓN UNIVERSIDADPOLITÉCNICADEMADRID
PH.D. THESIS
HARDWARE ACCELERATION OF MONTE CARLO-BASED SIMULATIONS
Author: Pedro Echeverría Aramendi Telecomunication Engineer
Advisor: María Luisa López Vallejo Profesor Titular del Dpto. de Ingeniería Electrónica Universidad Politécnica de Madrid
2011 Ph.D. THESIS: Hardware Acceleration of Monte Carlo-Based Sim- ulations
AUTHOR: Pedro Echeverría Aramendi
ADVISOR: María Luisa López Vallejo
El tribunal nombrado por el Mgfco. y Excmo Sr. Rector de la Universidad Politécnica de Madrid el día 21 de Noviembre de 2011, para juzgar la Tesis arriba indicada, compuesto por los siguientes doctores:
PRESIDENTE: D. Carlos Alberto López Barrio
VOCALES: D. Javier Díaz Bruguera
D. Florent Dupont de Dinechin
D. Luis Entrena Arrontes
SECRETARIO: D. Carlos Carreras Vaquer
Realizado el acto de lectura y defensa de la Tesis el día 21 de Noviembre de 2011 en la E.T.S. de Ingenieros de Telecomunicación, acuerda otorgarle la calificación de:
El Secretario del tribunal A mi familia Contents
Contents i
Abstract vii
Resumen xi
Acknowledges xv
List of Figures xvii
List of Tables xxi
1 Introduction 1 1.1 Motivation ...... 3 1.1.1 Acceleration Features of FPGAs ...... 4 1.1.2 Applications ...... 4 1.1.3 Designing with FPGAs. Challenges ...... 6 1.2 Objectives and Thesis Structure ...... 9 1.2.1 Monte Carlo Simulations and Target Application: LIBOR Market Model .. 9
i 1.2.2 Objectives ...... 11 1.2.3 PhD Thesis Structure ...... 13
2 Random Number Generation 15 2.1 Random Number Generation: Overall Introduction ...... 16 2.2 Uniform Random Number Generation ...... 17 2.2.1 Linear Congruential Generators (LCG) ...... 18 2.2.2 Combined Generator Rand2 ...... 20 2.2.3 Tausworthe Generators ...... 20 2.2.4 Mersenne Twister ...... 22 2.3 N(0,1) Gaussian Random Number Generation ...... 22 2.3.1 Generation methods ...... 22 2.3.2 Monte Carlo Implications and Hardware Implementation ...... 24 2.3.3 Inversion Method with Quintic Hermite Interpolation ...... 26 2.4 Variance Reduction Techniques ...... 30 2.4.1 Stratified Sampling and Latin Hypercube ...... 31 2.5 Developed Gaussian Random Number Generator ...... 32 2.5.1 Uniform Random Number Generator ...... 32 2.5.2 N(0,1) Gaussian Random Number Generator ...... 38 2.5.3 Stratified Sampling and Latin Hypercube ...... 47 2.5.4 Complete GRNG and SW-HW comparison ...... 51 2.6 Extending N(0,1) RNG ...... 53 2.6.1 Parameterisable RNG based on N(0,1) RNG ...... 54 2.7 Conclusions ...... 55
3 Implementing Floating-Point Arithmetic in Configurable Logic 57 3.1 Related Works ...... 58 3.2 Floating Point Format IEEE 754 ...... 60 3.2.1 Format Complexity ...... 61 3.3 Floating-point Units for FPGAs. Adapting the Format and Standard Compliance .. 62 3.3.1 Simplification of Denormalized Numbers ...... 63
ii 3.3.2 Truncation Rounding ...... 64 3.3.3 Hardware Representation ...... 65 3.3.4 Global Approach Analysis ...... 66 3.4 Operators Architecture ...... 67 3.4.1 Adder/subtracter ...... 67 3.4.2 Multiplication ...... 68 3.4.3 Division ...... 69 3.4.4 Square Root ...... 71 3.4.5 Exponential and Logarithm Units ...... 72 3.5 Libraries Evaluation and Comparison ...... 76 3.5.1 Comparison with respect to a Commercial Library ...... 77 3.5.2 Operators Evaluation ...... 78 3.5.3 Replicability ...... 82 3.6 Towards Standard Compliance and Performance ...... 86 3.6.1 Simplification of Denormalized Numbers: One Bit Exponent Extension ... 86 3.6.2 Truncation Rounding: Mantissa Extension ...... 86 3.6.3 FPGA-oriented floating-point library ...... 87 3.7 Conclusions ...... 89
4 Exponentiation Operator 91 4.1 Exponentiation function ...... 92 4.1.1 Related Work ...... 93 4.2 Range and error analysis ...... 93 4.2.1 Input-output range analysis ...... 94 4.2.2 General Error analysis ...... 95 4.2.3 Error Analysis for accurate xy ...... 97 4.3 Variable precision implementation with FloPoCo ...... 100 4.3.1 Logarithm ...... 101 4.3.2 Multiplier ...... 101 4.3.3 Exponential ...... 101 4.3.4 Exceptions Unit ...... 102
iii 4.4 Experimental Results ...... 102 4.4.1 Results Analysis ...... 103 4.4.2 Comparison with previous work ...... 103 4.4.3 Exceptions Unit ...... 104 4.5 Conclusions ...... 104
5 LIBOR Market Model Hardware Core 107 5.1 LIBOR Market Model ...... 108 5.1.1 LIBOR Market Model as base to compute financial products ...... 110 5.2 Model Analysis ...... 111 5.2.1 Variables’ Range ...... 112 5.2.2 Simplifications to the model: Factorization ...... 113 5.2.3 Operators’ complexity ...... 114 5.2.4 Model Summary ...... 114 5.2.5 Qualitative Profiling ...... 115 5.2.6 Data dependencies ...... 116 5.3 Adapting the model to Hardware ...... 117 5.3.1 Simulation order ...... 117 5.3.2 Tailored Arithmetic ...... 118 5.4 FPGA Monte Carlo Libor Market Model Engine ...... 119 5.4.1 General Architecture ...... 119 5.4.2 Gaussian RNG Core ...... 122 5.4.3 LMM Core ...... 123 5.4.4 Product Valuation Core ...... 131 5.4.5 Control Unit ...... 132 5.5 LMM Engine Implementation ...... 134 5.5.1 Operators’ Features ...... 134 5.5.2 LMM Core. Precision-Accuracy-Performance ...... 135 5.5.3 Cores Implementation ...... 140 5.6 Conclusions ...... 140
iv 6 Hardware-Software Integration 143 6.1 Hardware-Software Partitioning ...... 144 6.1.1 Tasks Stability Characteristics ...... 144 6.1.2 Communication overheads ...... 145 6.1.3 Achieve maximum possible acceleration ...... 145 6.1.4 Partitioning Policy ...... 145 6.2 System Architecture and Communications ...... 146 6.2.1 Why PCI-Express? ...... 147 6.2.2 Communications Model ...... 148 6.2.3 Communications Requirements ...... 150 6.3 PCI Express Core ...... 151 6.3.1 Within FPGA Communications ...... 152 6.4 Software ...... 155 6.4.1 Driver and Low Level Functions ...... 155 6.4.2 Application modification ...... 156 6.5 Experimental Results ...... 157 6.5.1 Complete Accelerator Implementation Results ...... 158 6.5.2 Software Profiling ...... 158 6.5.3 Hardware-Software Solution Results ...... 163 6.6 Conclusions ...... 167
7 Conclusions 169 7.1 Contributions and Conclusions of this Thesis ...... 170 7.1.1 Random Number Generators ...... 170 7.1.2 Floating-Point Arithmetic Operators and FPGAs ...... 171 7.1.3 LMM Hardware Accelerator ...... 173 7.1.4 Capacity and Performance of FPGAs. Accelerator design ...... 174 7.1.5 Hardware-Software co-design and Integration ...... 174 7.2 Future Lines of Work ...... 175 7.2.1 Research lines related to Improvements ...... 175 7.2.2 New Research Lines ...... 176
v Bibliography 179
vi Abstract
During the last years there has been an enormous advance in FPGAs. Traditionally, FPGAs have been used mainly for prototyping as they offer significant advantages at a suitable low cost: flexibility and verification easiness. Their flexibility allows the implementation of different generations of a given application and provides space to designers to modify implementations until the very last moment, or even correct mistakes once the product has been released. Second, the verification of a design mapped into an FPGA is easier and simpler than in ASICs which require a huge verification effort. Additionally to these advantages, the technological advances have added great capabilities and per- formance to FPGAs, and even though FPGAs are not as efficient as ASICs in terms of performance, area or power, it is true that nowadays they can provide better performance than standard or digital signal processor (DSP) based systems. This fact, in conjunction with the enormous logic capacity allowed by today’s technologies, makes FPGAs an attractive choice for implementation of complex digital systems. Furthermore, with their newly acquired digital signal processing capabilities, FPGAs are now expanding their traditional prototyping roles to help offload computationally intensive functions from standard processors. This Thesis is focused on the last point, the use of FPGAs to accelerate computationally intensive applications. The use of FPGAs for hardware acceleration is an active research field. However, there are still several challenges concerning the use of FPGAs as accelerators:
• Availability of Cores. • Capability and performance of FPGAs. • Methods, algorithms and techniques suited for FPGAs.
vii • Design tools. • Hardware-Software co-design and integration. Studying in depth each one of these five challenges related to hardware acceleration is not feasible in just one Thesis. The great variety of applications that can be accelerated and the different features among them imply that the complexity of each task is high. Therefore, in this Thesis we have chosen one subset of applications to be studied, dealing with the implementation of a real application of this subset. Selecting a complex subset of applications, in our case Monte Carlo simulations, allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a particular application. This subset of applications has several features shared with other appli- cations and allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a given application. Specifically, we have selected a financial applica- tion, the Monte Carlo based LIBOR Market Model. Developing an FPGA application from scratch is almost impossible and availability of cores is a must for shorten development time. Following this idea, one of the main objectives is to study the common elements that play a key role in Monte Carlo simulations and in our target application (and shared with many other applications). Two common elements have been outstood: • The random number generators that are required for the underlying random variables, • Floating-point operators, which are the base elements for implementing the mathematical mod- els that are evaluated. In this way, the first objective of this Ph.D. Thesis is the study, design and implementation of random number generators. In particular, we have focused on Gaussian random number generation and the implementation of a complete generator compatible with variance reduction techniques that can be used for our target application and for other applications. In this field we have developed a high-quality high-performance Gaussian random number gen- erator which is parameterizable and compatible with the also developed parameterizable Latin Hy- percube core and a high performance Mersenne Twister generator. Research results in this field demonstrate that random number generation is ideal for hardware acceleration, as an isolated core or within bigger accelerators. Meanwhile, the second objective has dealt with the implementation of efficient and FPGA-oriented mathematical operators (both basic and complex and using floating-point arithmetic). We focused on the design, development and characterization of libraries of components. Instead of focusing on the algorithms of the operators, our approach has been to study how the format can be simplified to ob- tain operators that are better suited for FPGAs and present better performance. One important goal searched here was to achieve libraries of general purpose components that can be reused in several applications and not just in a particular target application.
viii Different design decisions have been studied and analyzed, and from this analysis, the impact of the overhead due to some of the floating-point standard features has been determined. The format overhead implies a major use of resources and reducing it is a must to obtain operators, indepen- dently of what underlying calculation algorithm, that are better suited for FPGAs while present better performances. In particular, the handling of denormalized numbers has a major impact on the FPGA operators. Following the results obtained in that studied, we have discussed and selected a set of features that implies improved performance and reduced resources. This set, has been chosen to de- sign two additional hardware FPGAs-oriented libraries that ensure (or even improve) the accuracy and resolution given by the standard. The operators of these libraries are the base components for the implementation of target application. Additionally, a second analysis has been carried out to study the capabilities of FPGAs to imple- ment complex datapaths. This analysis shows the huge capabilities of current FPGAs which allow up to hundreds of single floating-point operators. Although this capacity, this second analysis has also demonstrate how the working frequency of the operators is severely affected by the routing of their elements when the operators are not isolated and a high percentage of the resources of an FPGA are used. Related to the target application, a third objective of this work was to deepen on the implementa- tion of a particular operator, the exponentiation function. This operator is required in many scientific and financial simulations. Its complexity and the lack of previous general purpose implementations have deserved special attention. We have developed and presented an accurate exponentiation operator for FPGAs based on the straightforward translation of xy into a chain of sub-operators and on the FPGA flexibility which allows tailored precisions. Taking advantage of this flexibility, the provided error analysis focused on determining which precisions are needed in the partial results and in the internal architectures of the sub-operators to obtain an accurate operator with a maximum error of one u. Finally, the integration of this error analysis and the development of the operator within the FloPoCo project have allowed to automatize the generation of exponentiation operators with variable precisions. The next objective we tackle was related to the global purpose of the Thesis of validating all the previously developed elements for the implementation of a complex Monte Carlo simulation which involves all the features that can be found in Monte Carlo simulations. In this way, we have deal with the implementation of the target application, the LIBOR Market Model (LMM). Special attention was devoted to all the features, requirements and circumstances that affect to the performance of the accelerator. A complete LMM hardware core has been developed and its results validated against the original software implementation. Three main features were analyzed:
• Correctness of the results obtained.
ix • Accuracy. • Speedup factors obtained by the global application and by each of the main components.
Finally, the last objective was the integration of the hardware accelerator within the original soft- ware application. All issues related to the communication mechanism are studied putting special focus on how performance is affected by data transfers and by the hardware-software partitioning policy implemented. Following the partitioning policy selected, we have developed the infrastructure (both hardware and software) required to make possible the integration of our accelerator within a software applica- tion. A mechanism, based on the use of two RAM memory zones and a PCI-E core with Bus Master capabilities in the FPGA, has been proposed and implemented. And it has allowed us to extend the intrinsic parallelism of Monte Carlo simulations to how the CPU and the FPGA work together. In this way, we exploit the CPU to work in parallel with the FPGA, overlapping their execution times. Hence, the software execution time affecting the performance is reduced to the initial and final processing and to the product valuation in case it is slower that LMM plus the random generator in the FPGA. With this scheme we have achieved high speedups, around 18 times, and close to the theoretical limit for our cases: when there is no software not ported to Hardware or which execution is overlapped with the FPGA execution (the LMM plus RNG achievable speedup). In this case, the speedup achieved could be considerably improved using new FPGAs and several LMM cores in parallel.
x Resumen
Durante los últimos años ha habido un enorme avance en la tecnología y capacidades de las FPGAs. Tradicionalmente, las FPGAs se han utilizado principalmente para el desarrollo de prototipos, ya que ofrecen importantes ventajas a un bajo coste: flexibilidad y facilidad de verificación. Su flexibilidad permite la implementación de las diferentes versiones de una aplicación determinada y permite a los diseñadores modificar las implementaciones hasta el último momento, o incluso corregir errores una vez que el producto esta siendo utilizado. En segundo lugar, la verificación de un diseño en una FPGA es más fácil y más sencillo que en ASIC, donde requieren un esfuerzo de verificación enorme. Además de estas ventajas, los avances tecnológicos han permitido FPGAs con grandes capacidades a la vez que se ha aumentado su rendimiento. Y aunque las FPGAs no sean tan eficientes como los ASIC en términos de rendimiento, recursos o el consumo de potencia, hoy en día pueden ofrecer un mejor rendimiento que un sistema estándar o que uno basado en procesadores digitales de señal (DSP). Esto, junto con la enorme capacidad de recursos lógicos alcanzada por las tecnologías de hoy, hace de las FPGAs una opción atractiva para la implementación de sistemas digitales complejos. Además, con su recientemente adquirida capacidad de procesamiento de señal digital, las FPGAs están ampliando su rol tradicional de prototipos al rol de coprocesador para descargar de cálculos intensivos a los procesadores estándar. Esta tesis se centra en el último punto, el uso de FPGAs para acelerar las aplicaciones com- putacionalmente intensivas. El uso de FPGAs para la aceleración de hardware es un área activa de investigación. Sin embargo, todavía hay varios desafíos relativos al uso de FPGAs como aceleradores: • Disponibilidad de cores de implementación.
xi • Capacidad y rendimiento de las FPGAs. • Necesidad de métodos, algoritmos y técnicas adecuadas para FPGAs. • Herramientas de diseño. • Co-diseño de Hardware-Software y su integración El estudio detallado de cada uno de estos cinco desafíos relacionados con la aceleración de hard- ware no es factible en tan sólo una tesis. La gran variedad de aplicaciones que pueden ser aceleradas y las diferentes características entre ellas, implica que la complejidad de cada tarea es alta. Por lo tanto, en esta tesis se ha elegido un conjunto de aplicaciones a estudiar, y se ha llevado a cabo la implementación de una aplicación real de este subgrupo. La selección de un subconjunto de aplicaciones complejas, en nuestro caso las simulaciones Monte Carlo, nos permite hacer un análisis general de la aceleración de hardware, nuestro campo principal, desde el estudio, análisis y diseño de una aplicación en particular. Este conjunto de aplica- ciones tiene varias características compartidas con otras aplicaciones y nos permite hacer un análisis general de la aceleración de hardware desde el estudio, análisis y diseño de una aplicación dada. En concreto, hemos seleccionado una aplicación financiera, la simulación del LIBOR Market Model basado en Monte Carlo. El desarrollo de las aplicaciones en FPGAs a partir de cero es casi imposible y la disponibilidad de los cores es una necesidad para acortar el tiempo de desarrollo. Siguiendo esta idea, uno de nuestros principales objetivos es el estudio de los elementos comunes que juegan un papel clave en las simulaciones de Monte Carlo y en la aplicación seleccionada (y compartidos con muchas otras aplicaciones). Dos elementos comunes han sido destacados: • Los generadores de números aleatorios que se requieren para las variables aleatorias subya- centes. • Los operadores de punto flotante, que son los elementos base para implementar los modelos matemáticos que se evalúan. De esta manera, el primer objetivo de esta Tesis es el estudio, diseño e implementación de gen- eradores de números aleatorios. En particular, nos hemos centrado en la generación de números aleatorio con distribución Gaussiana y en la implementación de un generador completo y compat- ible con técnicas de reducción de varianza que se utilizan en la aplicación seleccionada y en otras aplicaciones. En este campo de investigación hemos desarrollado un generador de números aleatorios gaus- sianos de alta calidad y alto rendimiento. A su vez, este generador es parametrizable y compatible con el módulo parametrizable de hipercubo latino también desarrollado y con un generador Mersenne Twister de alto rendimiento. Los resultados de investigación en este campo demuestran que la gen- eración de números aleatorios es idónea para la aceleración de hardware, tanto como un núcleo aislado o integrado en aceleradores mayores.
xii El segundo objetivo se ha ocupado del desarrollo de operadores matemáticos eficientes y ori- entados a FPGAs (tanto básicos como complejos y con aritmética de punto flotante). Nos hemos centrado en el diseño, desarrollo y caracterización de las librerías de componentes. En lugar de cen- trarnos en los algoritmos de los operadores, nuestro enfoque ha sido la de estudiar cómo el formato se puede simplificar para obtener operadores más adecuados para FPGAs y que a su vez presenten un mejor rendimiento. Un objetivo importante aquí buscado ha sido lograr librerías de componentes de propósito general que pueden ser reutilizados en varias aplicaciones y no sólo en una aplicación seleccionada en esta tesis. Diferentes decisiones de diseño se han estudiado y analizado. De este análisis, hemos determi- nado el impacto de la sobrecarga debido a algunas de las características del estándar de punto flotante. La sobrecargas que presenta este formato implican un mayor uso de los recursos y su reducción es una necesidad para obtener operadores más adecuados para FPGAs y con mejor rendimiento, inde- pendientemente de lo que el algoritmo de cálculo subyacente. En particular, el manejo de los números denormalizados tiene un gran impacto en los operadores de FPGA. Con los resultados obtenidos en ese estudio, hemos analizado y seleccionado un conjunto de características que implican un mejor rendimiento y una reducción de los recursos. Este conjunto, ha sido elegido para diseñar dos librerías adicionales para FPGA orientadas a garantizar (o incluso mejorar) la precisión y la resolución dada por el estándar. Los operadores de estas librerías son los componentes básicos para la implementación de la aplicación seleccionada. Además, un segundo análisis se ha llevado a cabo para estudiar las capacidades de los FPGAs para implementar complejos arquitecturas de datos. Este análisis muestra las enormes capacidades de FPGAs actuales que permiten a la implementación de cientos de operadores punto flotante en la misma FPGA. A pesar de esta capacidad, este segundo análisis también demuestra cómo la frecuencia de trabajo de los operadores se ve gravemente afectada por el interconexionado de sus elementos cuando los operadores no están aislados y se están utilizando un alto porcentaje de los recursos de la FPGA. Relacionado con la aplicación de destino, un tercer objetivo de este trabajo ha sido profundizar sobre la implementación de un operador en particular, la función exponenciación. Este operador es utilizado en muchas simulaciones científicas y financieras. Su complejidad, y la falta de las anteriores implementaciones de propósito general han merecido una atención especial. Hemos desarrollado y presentado un operador exponenciación exacto para FPGAs basado en la traducción directa de xy en una cadena de sub-operadores y en la flexibilidad de las FPGA que permite precisones a medida. Tomando ventaja de esta flexibilidad, el análisis de error se centró en determinar que lprecisiones son necesarias en los resultados parciales y en la arquitectura interna de los operadores de sub-para obtener un operador exacto con un error máximo de un ulp. Por último, la integración de este análisis de error y el desarrollo del operador en el proyecto FloPoCo han permitido automatizar la generación de los operadores de exponenciación con precisiones variables.
xiii El siguiente objetivo ha sido abordar, en relación con el objetivo global de la Tesis, la validación de todos los elementos desarrollados anteriormente con la implementación de un modelo complejo de simulación de Monte Carlo que incluye todas las características que se pueden encontrar en este tipo de simulaciones. De esta manera, abordamos la implementación de la aplicación seleccionada, el LIBOR Market Model (LMM). Se prestó especial atención a todas las características, requisitos y circunstancias que afectan al rendimiento del acelerador. Un core completo del LMM ha sido desarrollado en Hardware y validado contra los resultados del software original. Tres características principales se han analizado:
• La exactitud de los resultados obtenidos. • La precisión necesaria para el hardware. • Los factores de aceleración obtenidos por la aplicación global y por cada uno de los compo- nentes principales.
Finalmente, el último objetivo ha sido la integración del acelerador hardware con la aplicación de software original. Todas las cuestiones relacionadas con los mecanismos de comunicación se han estudiado poniendo especial énfasis en cómo el rendimiento se ve afectado por las transferencias de datos y por la política de particionamiento hardware-software implementado. Siguiendo la política de particionamiento seleccionada, hemos desarrollado la infraestructura (hardware y software) necesarios para hacer posible la integración de nuestros acelerador dentro de una aplicación de software. Un mecanismo, basado en el uso de dos zonas de memoria RAM y un core PCI-E con capacidad bus master en la FPGA, se ha propuesto e implementado. Y nos ha per- mitido extender el paralelismo intrínseco de simulaciones de Monte Carlo a la forma en la CPU y la FPGA trabajen juntos. De esta manera, se aprovecha la CPU para trabajar en paralelo con la FPGA, superponiendo sus tiempos de ejecución. Por lo tanto, el tiempo de ejecución de software que afecta el rendimiento se reduce al tratamiento inicial y final y la valoración del producto en caso de que sea más lento que el LMM más el generador de números aleatorios en la FPGA. Con este esquema hemos logrado incremento de velocidad altos, alrededor de 18 veces, y muy cerca del límite teórico de nuestros casos: cuando no hay software no portado al hardware o que su ejecución no se superponga con la ejecución de FPGA (la máxima aceleración alcanzable solo teniendo en cuenta el LMM más la generación de números aleatorios). En este caso, la aceleración lograda podría ser considerablemente mejorada con FPGAs nuevas y con varios núcleos de LMM en paralelo.
xiv Acknowledgments
I would like to thank to Marisa, who have been guiding my research since my final degree project and working with my since then. This work is also hers. Thanks for giving me the opportunity of working in this Thesis, for all her advices and contributions that have enriched this Thesis and for these years of collaboration and friendship. Thanks also to Carlos López Barrio for his support, for his advices and for sharing his experience with me. Also I would like to thank to all the people from the LSI research group all these years of sharing good moments, meals and conversations. Special thanks to Miguel Angel and Pablo who have collaborate with my in some research fields and have become good friends. I would also like to thank to my friend Paco, who have helped me in the development of the driver and shared with me many conversations about this Thesis and its research. And to Florent, without whom chapter four woud not have been possible. Thanks to Pedro and Rocio, who have shared many hours and conversations with me at the office. I would like to acknowledge also to BBVA and the New Products Department for funding and supporting this Thesis through the project P060920579. Specially, I would like to thanks to Miguel Ángel, Javier, Manuel and José María and Antonio. To my parents and family, this book is for them. And finally, thanks to my wife Marta, for all her support, love and patience during these years.
xv
List of Figures
1.1 Thesis Structure...... 13
√1 2.1 Plot of m ...... 25 2.2 Inversion Method...... 27 2.3 Dimension Impact: Stratified Sampling & Latin Hypercube...... 32 2.4 Mersenne Twister General FPGA Architecture...... 35 2.5 MT work are Storage...... 36 2.6 UNU.RAN segmentation...... 39 2.7 Inversion N(0,1) RNG architecture...... 44 2.8 Search unit...... 44 2.9 Variance reduction general hardware architecture...... 47 2.10 Stratified Sampling Control Unit...... 48 2.11 Latin Hypercube Control Unit...... 49 2.12 Stratified Sampling and Latin Hypercube results...... 51 2.13 Inversion based GRNG with Variance Reduction technique...... 51 2.14 Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG ...... 54
3.1 Floating-Point word...... 60
xvii 3.2 Floating-Point Operator...... 61 3.3 Adder-Subtracter...... 68 3.4 Multiplier Architecture ...... 70 3.5 Division step...... 71 3.6 Square Root...... 72 3.7 Exponential function unit...... 74 3.8 Logarithm function unit ...... 75 3.9 Operators Evaluation...... 79 3.10 Slices and Stages per type for each library...... 80 3.11 Operators Replicability (HP Library)...... 82 3.12 Synthetic Datapath...... 83 3.13 Adder Replicability Results...... 84 3.14 Divider Replicability Results...... 84 3.15 Multiplier Replicability Results...... 85 3.16 Logarithm Replicability Results...... 85 3.17 Towards Standard Operators Evaluation...... 88
4.1 Simplified overview of a power function unit...... 92 4.2 Power function architecture...... 101
5.1 LIBOR Forward Rates...... 109 5.2 Monte Carlo LIBORs simulation...... 110 5.3 LMM Monte Carlo Simulation with Latin Hypercube...... 117 5.4 Engine Architecture...... 120 5.5 LMM Core unit ...... 124 5.6 Parallel Correlation...... 125 5.7 Sequential Correlation...... 125 5.8 Drift Calculation...... 129 5.9 LIBOR Calculation...... 130 5.10 LMM datapth...... 130 5.11 Product Valuation Core...... 131 5.12 Architecture for LMM accuracy measurement...... 136
xviii 5.13 SW-HW difference in average ...... 137 5.14 Maximum SW-HW difference ...... 138 5.15 SW-HW difference in the last time step ...... 139
6.1 Integration Architecture...... 146 6.2 Communications Flow ...... 147 6.3 Dataflows...... 149 6.4 PCI-Express Core ...... 152 6.5 PCI-Express & Accelerator Interface ...... 154 6.6 Detailed-view Software-Hardware modified dataflow ...... 157
xix
List of Tables
1.1 Speedups obtained in different case studies ...... 6
2.1 Gaussian Generation Methods-Selection Criteria...... 26 2.2 Gaussian ICDF Implementation Requirements...... 28 2.3 Virtex-4 XC4VFX140-11. Table of resources...... 37 2.4 URNG Implementation Results...... 37 2.5 Accuracy and Segmentation...... 45 2.6 Maximum segment size - Number of segments (searched 21 mantissa bits of accuracy) 45 2.7 FPGA N(0,1) RNG results...... 47 2.8 Variance Reduction implementation Results ...... 50 2.9 Complete N(0,1) RNG results...... 51 2.10 Hardware-Software Comparison ...... 52 2.11 Parameterisable RNG N(µ,σ)-LogN(µ,σ) RNG results...... 54
3.1 Floating-Point Operators Libraries. Four Basic Operators ...... 59 3.2 Types of floating-point numbers...... 61 3.3 Logic Reduction due to Design Simplifications...... 67 3.4 Operators Results. Commercial Library...... 77
xxi 3.5 Operators Results...... 78 3.6 Slices type of logic per operator ...... 79 3.7 Split Slices comparison...... 80 3.8 Split Pipeline Stages comparison...... 80 3.9 FPGA Resources...... 82 3.10 Operators results with the final proposed features...... 87 3.11 Required Interfaces...... 88
4.1 Exception handling for the exponentiation function in the IEEE-754 standard. .... 93 4.2 Sub-operators Range Analysis...... 94 4.3 Powering function Relative Error (ulp)...... 97 4.4 Synthesis results for Virtex-4 (4vfx100ff1152-12) for pow function...... 102 4.5 Separate synthesis results for the sub-component (targeting 200MHz)...... 103 4.6 Separate synthesis results for Exception unit...... 103 4.7 Exception’s control results...... 104
5.1 Parameters Range...... 115 5.2 Implementation Results for the Modified operators...... 139 5.3 Implementation Results of the LMM Engine for V5-FX200...... 140
6.1 Types of Data Transfers...... 150 6.2 Implementation results of the complete accelerator for a V5-FX200...... 158 6.3 Test Simulation Features ...... 159 6.4 Software Profiling...... 160 6.5 Main variables to compute per subpath ...... 160 6.6 Extrapolated Profile (5000 grouped paths) ...... 161 6.7 LMM, GRNG and LMM+RNG Speedups ...... 162 6.8 Extrapolated achievable speedup ...... 162 6.9 Hardware-Software Profiling...... 164 6.10 Extrapolated times and measured speedup ...... 164 6.11 Resources of the Different Xilinx FPGA Families...... 165 6.12 Advanced FPGAs extrapolation...... 166
xxii 1
Introduction
Since the first digital computers were designed and developed, one of the main necessities that have boosted the research in computer science has been the need for higher performance. There has always been a huge number of applications pushing the limits of computer technologies, demanding more performance: scenarios with real time requirements or excessive long execution times which made the application unfeasible. On the other side, never-ending improvements on silicon technologies happen bringing higher and higher performance: deep nanometer technologies (currently 20 nm is almost available [Ete11] and 14nm is expected in 2012) which allow the integration of billions of transistors in a single die, dif- ferent types of logic core devices with different oxide thicknesses and threshold voltages to meet the requirements of high performance, low-standby power, or low-operating power circuit applications, etc. In addition to those continuous improvements, designers have relied on solutions based on special architectures to accelerate the performance of these applications, with processing units exploiting their common features such as parallelism, repetitive tasks or intensive mathematical processing. Traditionally, these solutions have been of two types: 2 Introduction
• Parallel processing computers with parallel processors and/or datapaths. • Dedicated hardware, accelerators, specialized in computing one type of processing task which complements conventional architectures.
However, this situation has changed. Computers based on conventional architectures and proces- sors have always been much more competitive in price than parallel processing computers. There- fore, as the technology to interconnect different computers to work like just one big supercomputer has been continuously improving, parallel processing computers have been gradually replaced by clusters of conventional computers and multicore processors. Furthermore, as power consumption has become a main concern in recent years [KAB+03], parallel computing has become the dominant paradigm in computer architecture and has been incorporated to conventional architectures, mainly in the form of the above mentioned multicore processors. In this new scenario dominated by conventional multicore computers and clusters built with them, acceleration continues to be a great necessity due to the following reasons:
• There are applications that cannot be accelerated by a cluster or a multicore architecture: – Single-thread applications. – Embedded systems. – Single computer environments. • Energy and space required by large clusters are key limiting factors. • Multicore processors are based on general purpose cores. Consequently, they are not optimal for all types of computations.
Therefore, while the use of specific parallel processing computers has declined, new solutions continue to appear in the field of hardware accelerators. In particular, the use of complementary hardware accelerators is blossoming and becoming more and more important. Focusing on those problems, acceleration can be provided using different technologies, mainly three of them:
• ASICs, Application Specific Integrated Circuits • GPUs, General Purpose Graphical Processing Units. • FPGAs, Field Programmable Gate Arrays
ASICs share the same technology than general purpose microprocessors but they are specifically designed for a particular application and are not controlled by software. Therefore, they can be the most efficient technology for any computational task. However, their use for acceleration purposes is very limited. On the one hand, they do not offer any flexibility, as the task that they perform cannot be modified. On the other hand, the high cost of designing and manufacturing any integrated circuit, restricts their use to applications where millions of circuits can be sold. 1.1 Motivation 3
In the last years, GPUs, which are already accelerators for personal computers that handle all graphics processing, have started to be used to accelerate applications with similar characteristics to graphics processing due to the significative performance they have achieved. Furthermore, the addition of programmable stages in the GPUs datapath has allowed generalizing the use of GPUs, leading to general-purpose computing on graphics processing units, GPGPU [RHS+08]. However, applications with complex feedback loops and control or extensive bit handling are not suitable for GPU implementation. Meanwhile, the high power consumption of GPUs restricts their use in certain environments. As with GPUs, during the last years there has been an enormous advance in FPGAs. Traditionally, FPGAs have been used mainly for prototyping as they offer significant advantages at a suitable low cost [Hau98]: flexibility and verification easiness. Their flexibility allows the implementation of dif- ferent generations of a given application and provides space to designers to modify implementations until the very last moment, or even correct mistakes once the product has been released. Second, the verification of a design mapped into an FPGA is easier and simpler than in ASICs which require a huge verification effort. Additionally to these advantages, the technological advances have added great capabilities and performance to FPGAs, and even though FPGAs are not as efficient as ASICs in terms of perfor- mance, area or power, it is true that nowadays they can provide better performance than standard or digital signal processor (DSP) based systems. This fact, in conjunction with the enormous logic capacity allowed by today’s technologies, makes FPGAs an attractive choice for implementation of complex digital systems. Furthermore, with their newly acquired digital signal processing capabili- ties [Xilf], FPGAs are now expanding their traditional prototyping roles to help offload computation- ally intensive functions from standard processors.
1.1. Motivation
This Thesis is focused on the last point, the use of FPGAs to accelerate computationally intensive applications. Current deep sub-micron technologies allow manufacturing FPGAs with extraordinary logic density and speed. The initial challenges related to FPGAs programmability and large inter- connection capacitances (poor performance, low logic density and high power dissipation) have been overcome [KR07] while providing attractive low cost and flexibility. Additionally, nowadays FPGAs provide not only configurable logic. They also provide power- ful DSP units (mainly for multiplication and accumulation operations) and embedded RAM mem- ory [Altb, Xild], while high speed elements are also provided for interconnections. Subsequently, the use of FPGAs in the implementation of complex applications, see section 1.1.2, is increasingly common thanks to the set of features that make FPGAs a powerful alternative for acceleration, see Section 1.1.1. However, several challenges and concerns are still remaining related to hardware ac- 4 Introduction celeration with FPGAs, Section 1.1.3.
1.1.1. Acceleration Features of FPGAs
Even though the clock frequencies that can be achieved with an FPGA are low when compared to a high-end microprocessor, ten times slower, accelerating applications with FPGAs can potentially deliver enormous performance [HVG+07a] due to the intrinsic nature of the FPGA’s architecture, which allows:
• Parallel architectures. • Cascaded datapaths. • Deeply pipelined architectures.
When designing an accelerator with an FPGA, two levels of parallelism can be achieved. Firstly, multi-thread parallelism as the application datapath can be replicated in parallel as many times as possible, being the number of resources of the FPGA the only a priori limit. And secondly, the datapath can also be a parallel datapath executing in parallel several instructions and operations taking into account only data dependencies. Furthermore, datapaths in an FPGA can be also implemented as a complete chain, as all overhead instructions needed in software related to indexes, memory access, conditional loops, etc. can be done in parallel and in advance. Thereby, control instructions and instructions related to moving data do not affect the datapath. Finally, the FPGAs combination of combinational and register logics allows deeply pipelined architectures and processing new data every clock cycle, with the only limitations implied by data dependencies or communications.
1.1.2. Applications
Computationally intensive applications can be found in almost every field where a computer is used. Nevertheless, the advances in computer technologies have provided solutions to a great number of these applications making possible their execution with up-to-date conventional computers. However, there exist other complex applications requiring higher performance than the one pro- vided by conventional computers. Additionally, new more complex applications continue to appear. These applications can be classified into three interrelated groups:
• High Performance Computing. • Real time applications. • Intensive applications in embedded systems. 1.1 Motivation 5
The range of applications where FPGAs can be used as accelerators includes any application where some degree of parallelism can be exploited or where a large datapath is required. Related research fields can be bioinformatics [EH09], where FPGAs are used for DNA sequenc- ing and dot plotting, or molecular dynamics simulations [SGTP08, CHL08, AAS+07], where the simulation of the motion models for the time evolution of a set of interacting particles is accelerated. Another outstanding research field is the processing of medical images, where real-time features are desirable and can be achieved with FPGAs [GMDH08, DCPS07]. FPGA acceleration also includes fields as financial simulation [Fel11], geophysical simulation for oil research, aerodynamics research, fluid dynamics, etc. Another interesting research field is the implementation of basic linear algebra subsystems [ZP08]. While they are not applications by themselves, these linear algebra subsystems are computationally intensive routines common to many applications.
1.1.2.1. Monte Carlo Applications
Finally, we have to stand out one subset of applications, Monte Carlo simulations. Monte Carlo ac- celeration with FPGAs is an active research field in recent years as the intrinsic nature of the Monte Carlo approach can be exploited by FPGA properties. On the one hand, Monte Carlo models repeat thousands of times the same calculations, only varying the values of the underlying random variables. Therefore, they fit perfectly with FPGAs parallelism capabilities of replicating many times the dat- apath. On the other hand, Monte Carlo datapaths are mainly composed of mathematical operations with a small load of control instructions, and therefore are very well suited for exploiting datapath capabilities of deeply pipelined chained datapaths. This is the case of physical simulations like [PF06], where an FPGA has been used to approach a real-time solution for a radiation transportation problem, [LRL+09] where Monte Carlo is used to calculate light propagation in tissues for medical photodynamic therapy, or in radiotherapy treatment planning [FFM+10]. Another remarkable field is financial simulation [ZLHea05, KCL08, MA07, WV08, TTTL10], where pricing calculation of different financial products is accelerated. Furthermore, circuit design can take advantage of Monte Carlo simulations for timing analysis methods on VLSI design [YTOS11]. In Table 1.1 the speedup factors obtained using FPGAs are summarized for some of these ap- plications with a maximum speedup of 650. Additionally, not only simulation time is significatively reduced when FPGAs are used, energy consumption is also decreased. This feature is not usually measured, however some results can be found in the literature where energy consumption is reduced by a factor of 45 [LRL+09]. 6 Introduction
Table 1.1: Speedups obtained in different case studies Field SpeedUp (xTimes) [EH09] Bioinformatics 45 [PF06] Physical Monte Carlo 650 [LRL+09] Physical Monte Carlo 80 [ZLHea05] Financial Monte Carlo 26 [KCL08] Financial Monte Carlo 63 [WV08] Financial Monte Carlo 50 [TTTL10] Financial Monte Carlo 24
1.1.3. Designing with FPGAs. Challenges
As just seen, the use of FPGAs for hardware acceleration is an active research field. However, there are still several challenges concerning the use of FPGAs as accelerators. We have identified the following key challenges:
• Availability of Cores. • Capability and performance of FPGAs. • Methods, algorithms and techniques suited for FPGAs. • Design tools. • Hardware-Software co-design and integration.
1.1.3.1. Availability of Cores
Designing complex applications from scratch with FPGAs makes the design cycle extremely long. It is necessary to analyze the application to design the architecture required for the datapath with the target of achieving the highest possible performance. Furthermore, it must be analyzed how to integrate the control requirements into that datapath. Finally it is necessary to develop all the required basic elements, as the RNG, or the arithmetic operators, etc. which implies studying the type of arithmetic to be used and the resolution and precision of the numbers (data representation). Thus, the availability of complete and fully characterized basic elements (i.e. operators) targeting FPGAs has become essential to shorten the time needed to design an application while making its design easier. In this way, when designing any application, it is basic for the designer to have at its disposal libraries of mathematical operators and other components, being this topic one of the foci of this Thesis; the analysis and design of mathematical operators. Other components, like communication cores, also play a key role in the development of any application because the hardware accelerator must interact with the host system. Hence, this topic 1.1 Motivation 7 will be also analyzed in this Thesis.
1.1.3.2. Capability and performance of FPGAs
FPGA resources are mainly programmable logic plus interconnections between the logic. Once a design is completed, it is mapped and routed into these logic elements to configure the FPGA with the design functionality. In this way, the results obtained for an application can vary depending on which logic the application requires and how this logic is used (programmed). Additionally, other facts can determine the performance of a design: • Routing easiness of the design (related to the percentage of the FPGA being used). • Use of embedded elements. • Code quality. • Hard dependencies in the logic. These issues affect both the capability and performance of FPGAs. With respect to the first issue, as nowadays FPGAs have a huge amount of resources it usually does not represent a big technical problem but it can be of great concern with respect to economic issues due to the high cost of the largest FPGAs [CA07]. With respect to the performance that can be achieved, it cannot be exactly predicted before the design is completed, making difficult to determine if an application is suited for FPGAs or not. How- ever, if the design only requires characterized cores or well-known structures, the performance can be inferred through the expected clock frequency and throughput.
1.1.3.3. Methods, algorithms and techniques suited for FPGAs
The FPGA’s unique combination of features and resources makes necessary to reevaluate the meth- ods, algorithms and techniques used for computing to decide which ones are the most suited for FPGAs [HVG+07b]. FPGAs offer to the designer total flexibility for the selection of signal bit-widths, the arithmetic used, the operations or tasks carried out, the design of the datapaths, etc. Furthermore, the inherent parallel nature of FPGA architectures opens the design possibilities, allowing chained datapaths with control instructions carried out in parallel. Adding these features to the set of resources available in current FPGAs we found that the well established software computing paradigms have to be reevaluated. The techniques and methods con- sidered optimal for software may be not optimal for FPGAs, or may require changes or even im- provements to take advantage of FPGA’s features. This way, the fully exploitation of FPGAs will not only require the adaptation of methods and algorithms but also the development of new techniques specially suited for FPGAs. 8 Introduction
1.1.3.4. Design Tools
Hand-written RTL development and debugging are too time-consuming and error prone. The avail- ability of good high-level design tools is one of the major lacks that a designer has to face up when developing a hardware accelerator. Their use can substantially shorten the development cycle, ab- stracting the designer from low level details, helping in the debugging of the design, and automatically carrying out tedious tasks. Additionally, these tools should provide useful features as:
• Quick design space exploration. • Verification aids (test bench generation, software-hardware comparison). • Support for different arithmetics. • Automated hardware-software integration.
Currently, the availability of this kind of tools is very limited. The focus is on translating C code to synthetizable RTL code [Tec, Gra]. The possible reduction of acceleration related to the use of these tools is worth being paid in order to shorten the design time. It is not an objective of this Ph.D. Thesis to develop this kind of tools, but a particular interest will be devoted to put forward the special needs that the development and design of FPGA-based accelerators requires.
1.1.3.5. Hardware-Software co-design and integration
When developing a hardware accelerator we are dealing with all the major issues related to hardware- software co-design. First, it has to be decided which parts of the code are going to be executed in the accelerator and which ones remain in software. Second, the communication mechanism between software and hardware has to be defined and implemented. Third, software and hardware have to be integrated. These tasks are complex and require expert designers because they have a strong impact on the performance of the accelerated application. With respect to the original code, the designer has to evaluate not only which parts of the code are most suited for the accelerator and which ones should remain in software, but also the impact of the data transfers associated to the tasks carried out in the hardware. These data transfers may become a bottleneck in the system, degrading the global performance. An efficient communication mechanism is required for the synchronization of software and hard- ware and for the data transfers. From the software point of view, this implies a computational over- head that must be as small as possible. From the hardware point of view if the communicating mech- anism is not efficient it could imply that the hardware does not work at its maximum performance when the hardware accelerator has to wait to transfer data. 1.2 Objectives and Thesis Structure 9
Finally, the software and the hardware must be integrated, involving complex and tedious tasks (the modification of the software to invoke the hardware replacing the accelerated code, a driver, a low-level library to control the driver from the software and the hardware itself, etc).
1.2. Objectives and Thesis Structure
Studying in depth each one of these five challenges related to hardware acceleration is not feasible in just one Thesis. The great variety of applications that can be accelerated and the different features among them imply that the complexity of each task is high. Therefore, in this Thesis we have chosen one subset of applications to be studied, dealing with the implementation of a real application of this subset. Selecting a complex subset of applications, in our case Monte Carlo simulations, allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a particular application. This subset of applications has several features shared with other appli- cations that allows us to make a general analysis of the main topic, hardware acceleration, from the study, analysis and design of a given application. Specifically, we have selected a financial applica- tion, the Monte Carlo based LIBOR Market Model. Financial simulation is a remarkable research field for FPGA acceleration of Monte Carlo simu- lations where obtaining quick and accurate results is essential. However, most financial models are computationally intensive and acceleration is necessary to obtain their results with the required quick- ness. Additionally, there is a great variety of financial models allowing the selection of a complete model where different hardware acceleration issues can be studied. The focus of this work is on identifying and providing the key elements of a hardware accelerator and incorporate them in the target application characterized by high complexity and hard timing requirements. Furthermore, the integration of the hardware and software parts will also be addressed in depth. In this section we will start by presenting Monte Carlo basis and the target application to identify the elements that play a key role in its hardware acceleration. Next, these elements will be used as base to formulate the main objectives of this Thesis.
1.2.1. Monte Carlo Simulations and Target Application: LIBOR Market Model
Monte Carlo simulation is often the only tool for dealing with otherwise intractable problems in the areas of scientific calculation or stochastic problems. A particular case of these intractable problems is financial simulation (the simulation of pricing of financial derivatives), characterized by its extremely complex models and hard execution time constraints. Both requirements have been identified as especially challenging in the case of FPGA-based acceleration, being therefore, the application case 10 Introduction used as benchmark in this Thesis.
1.2.1.1. Monte Carlo Basis
Monte Carlo simulations rely on the use of random numbers to evaluate mathematical expressions. In these simulations, the main state variables of the system under study are sampled using random number generation and the evaluation of a mathematical expression is repeated several times with different random numbers. Finally, the results are generally obtained as measures of the probability distribution of the system properties. The accuracy of these methods is ensured by the Law of Large Numbers, that guarantees the convergence of the method when the number of simulations grows to infinite. For a finite number of replications an error is introduced due to the existence of variability in the final result. Monte Carlo methods were firstly developed [MU49] by physicists at Los Alamos Laboratory in the context of the Manhattan project. Now, they are widely used to solve three types of problems: mathematical problems whose analytical expressions are very complex (as certain multidimensional integrals), modeling of physical phenomena where there is uncertainty, and finally, systems with a with a large number of coupled degrees of freedom. Among from the last two types of problems, special attention has been paid to the simulation of physical systems such as molecular dynamics [NBK+09], quantum systems [WAG09] or ray tracing, and the case where we have focused simulation of financial systems [Gla04], where Monte Carlo is used to value portfolios, interest rate options and other financial products, or insurance simula- tions [PHT+07].
1.2.1.2. Specific Design Challenges
Even though Monte Carlo methods are used to resolve very different problems, all these simulations have three shared features that convert them in complex applications that are also computationally very intensive:
• Use of random samples. • Use of complex mathematical expressions. • A large number of replications.
These features make them ideal candidates for hardware acceleration. Furthermore, the complex- ity may be even increased due to:
• Complex distributions for the random samples (Gaussian or Log-normal). • Variance reduction techniques. 1.2 Objectives and Thesis Structure 11
• Complex floating-point operations. • Datapaths requiring control with added complexity (data dependencies).
1.2.1.3. LIBOR Market Model
The Libor Market Model (LMM) [BGM97] is a model for pricing interest rates derivatives. This model involves several variables to be calculated before obtaining its main variables, the LIBORs. The calculation of these variables can be done with different models and complexities, even needing exponentiation operators and vector products. Therefore, it provides a perfect scenario to explore different solutions for this type of complex operators and structures. In chapter 5, the LMM will be explained in detail. However, we can advance the main features of the model from the perspective of an FPGA implementation:
• High quality Gaussian random variables are required. • The use of Variance reduction techniques is mandatory to reduce the simulation time. In par- ticular we will study Latin Hypercube. • High accuracy requirements: huge number of replications and floating-point arithmetic. • Complex floating-point operations as exponentiation function are required. • There are computationally intensive routines as correlation of variables. • Presence of complicated data dependencies. • Complex control due to different simulation scenarios synchronization
In this way, the LIBOR Market Model is a demanding benchmark, ideal for our purpose of study- ing the identified design challenges related to FPGA hardware acceleration, and, in addition, a good application to research if FPGA acceleration is a feasible solution for financial simulation.
1.2.2. Objectives
As just mentioned, Monte Carlo Simulations are especially well suited for being accelerated using reconfigurable hardware. Nevertheless, developing an FPGA Monte Carlo simulation from scratch is almost impossible or it is very time consuming. However, if the designer can use predesigned libraries of the most common elements, the development time can be substantially reduced, and this is the first focus of this Ph.D. Thesis. Following this scheme, one of the main objectives is to study the common elements that play a key role in Monte Carlo simulations and in our target application. Two common elements can be outstood: first the random number generators that are required for the underlying random variables, 12 Introduction and second, floating-point operators that are the base elements for implementing the mathematical models that are evaluated. In this way, the first objective of this Ph.D. Thesis is the study, design and implementation of random number generators. In particular, we focus on Gaussian random number generation and the implementation of a complete generator compatible with variance reduction techniques that can be used for our target application and for other applications. Meanwhile, the second objective deals with the implementation of efficient and FPGA-oriented mathematical operators (complex and using floating-point arithmetic). We focus on the design, devel- opment and characterization of libraries of components. Instead of focusing on the algorithms of the operators, our approach will be the study of how the format can be simplified to obtain operators that are better suited for FPGAs and present better performance. One important goal here is to achieve libraries of general purpose components that can be reused in several applications and not just in a particular target application. Related to the target application, a third objective of this work is to deepen on the implementation of a particular operator, the exponentiation function. This operator is required in many scientific and financial simulations. Its complexity and the lack of previous general purpose implementations deserves special attention. The next objective is related to the global purpose of the Thesis of validating all the previously developed elements for the implementation of a complex Monte Carlo simulation, that involves all the features that can be found in Monte Carlo simulations. In this way, we will deal with the imple- mentation of the target application, the LIBOR Marker Model. Special attention will be devoted to all the features, requirements and circumstances that affect to the performance of the accelerator. To validate all the research done, the next objective is to obtain the experimental results provided by the developed FPGA accelerator which are validated against the original software implementation. Three main features will be analyzed:
• Correctness of the results obtained.
• Accuracy.
• Speedup factors obtained by the global application and by each of the main components.
Finally, the last objective is the integration of the hardware accelerator within the original software application. All issues related to the communication mechanism are studied putting special focus on how performance is affected by data transfers and by the hardware-software partitioning policy implemented. 1.2 Objectives and Thesis Structure 13
Figure 1.1: Thesis Structure.
1.2.3. PhD Thesis Structure
As mentioned before, the scheme followed in this Thesis is a bottom-up methodology with respect to the target application, starting by the components and cores needed, following with the implementa- tion of an accelerator and finishing by its integration with the original software application. The structure of this Thesis follows this methodology and we have organized this document in five main chapters each one corresponding to one of the key research objectives previously identified: 2. Random number generation. 3. Libraries of floating-point operators. 4. Exponentiation function. 5. Development of the target application (LIBOR Market Model). 6. Hardware-software integration. Figure 1.1 shows how these objectives interact and in which chapters they are discussed. The first three objectives tackle the study, design and implementation of the common elements that are required in Monte Carlo simulations, random number generators and mathematical operators, Chapters 2, 3 and 4. Three main features are searched for: obtaining high performance cores, reusability and specific design to take advantage of the FPGAs set of resources. In particular, Chapter 2 focuses on Gaussian random number generators (they also comprise uni- form random generators) as they are widely used on Monte Carlo simulations. Three main topics are studied with respect to FPGA acceleration: uniform random number generators, gaussian random number generators and variance reduction techniques. Finally, Chapter 2 ends with the implemen- tation of a parameterizable gaussian random number generator compatible with variance reduction techniques that will be the base for the random generation of the target application. Regarding mathematical operators, Chapter 3 concentrates on floating-point arithmetic as this is the arithmetic required for many scientific and financial Monte Carlo simulations. Floating-point 14 Introduction standard and format are studied and several simplifications of are proposed. Additionally, the capa- bilities of FGPAs to implement data chains with a high number of operators are studied. Meanwhile, in Chapter 4 we analyze in further detail a complex operator as exponentiation focusing on how we can take advantage of FPGA flexibility to ensure an accurate result. Once we have studied the key elements for any Monte Carlo simulation, random generators and mathematical operators, the next issue we face is the implementation of an accelerator corresponding to a real application and the challenges this implies, Chapter 5. In this chapter, we study how a complex model can be implemented in a hardware accelerator and all the limitations and restrictions that we have to face up. Finally, the accelerator has been integrated within the original application in a personal computer system, Chapter 6. In this chapter, the software-hardware partition policy that we have followed for the implementation of the LMM core can be found. Each chapter is almost a complete study of the topic under analysis, comprising theory, study, design, implementation and results. In the same way, each chapter also includes its own review of related works. 2
Random Number Generation
As in other stochastic simulations, a key element for any Monte Carlo simulation is a good random number generator to sample the variables under study. Even more, good quality random numbers are a must, since the quality of the results obtained is directly related to the quality of the random numbers used. Depending on the nature of the model simulated using the Monte Carlo method, the required random numbers will follow specific probability distribution functions. However, almost all random number generation methods rely on the use of a base uniform random number generator (URNG) whose samples are transformed into the target distribution following some method or equation. There- fore, using a good URNG is a key issue for any random number generator (RNG). When random numbers are related to the Monte Carlo method, besides the quality of the numbers and the specific distribution needed, one more fact has to be taken into account: the compatibility with variance reduction techniques. The huge number of replications of the model needed in Monte Carlo simulations impacts directly in the total simulation time required. These techniques have been developed to reduce the number of replications and hence to reduce the total simulation time, so random number generation have to be studied taking into account these techniques. 16 Random Number Generation
In this chapter random number generation is studied from the global perspective of Monte Carlo simulation and considering FPGA implementation issues. First, an overall introduction to RNGs is provided, following with an analysis of URNGs. Afterwards, Gaussian RNGs are studied, as the gaussian distribution is one of the most common distributions for Monte Carlo simulations and is the one required for our target application, the LMM. Then, variance reduction techniques are introduced. Hardware related issues to all these topics are provided while some generators for FPGAs are developed. The main objective of this chapter is focused in this point: the study and implementation of FPGA generators oriented to hardware acceleration. In this way, special attention will be devoted to the implementation of a parameterizable gaussian RNG compatible with variance reduction tech- niques designed specifically to be used in accelerators. This element is not only a key component for our selected application and many other Monte Carlo simulations, its quality is also determinant for the accuracy of the simulation results. Hence, fulfilling high quality requirements will be another important issue in this chapter. As is exposed in Section 2.3.2, in the literature we cannot find any FPGA Gaussian RNG that fulfils this criterion in combination with all the other criteria that we have identified for a Gaussian RNG. In this way, a new gaussian generator is developed focusing in three main components:
• The gaussian generation method selected, the inversion method, and how we adapt it to FPGAs.
• A high-performance uniform RNG, a Mersenne Twister one, to be used as base generator for the inversion-based gaussian RNG.
• A parameterizable variance reduction techniques core to be used in combination with the uni- form RNG.
2.1. Random Number Generation: Overall Introduction
An ideal RNG fulfills two main characteristics. Firstly, it generates random numbers whose distri- bution follows exactly the target distribution. And secondly, the random numbers should be uncorre- lated, in other words, mutually independent. However, except for some rare generators based on physic events like radioactivity or the elec- tron’s spin and thereby not feasible to be used here, both hardware and software conventional random number generators cannot completely fulfil the second characteristic. Generators rely on equations which use previously generated random numbers or events that are not completely random. Consequently, it is more adequate to talk about pseudo-random generators, and taking this into account, the main feature that a good generator should accomplish is that, for the given quantity of random numbers needed, the generator behavior resembles the behavior of an ideal generator. 2.2 Uniform Random Number Generation 17
Obviously, this is not a measurable characteristic, but an idea of how good a random number is can be obtained by looking at:
1. Periodicity: algorithm-based generators have a period. The sequence of generated numbers starts repeating after a period since algorithms are deterministic and have a finite number of states.
2. Randomness: the generated sequence must behave as a truly random sequence. Randomness is not a quantifiable parameter but there are two methods to evaluate it:
• Theoretical properties of the algorithm. • Statistical tests.
Additionally, other important features to consider for a random number generator are:
1. Reproducibility: capacity of generating the same sequence again. In algorithm-based genera- tors if the same seed is used, the sequence will always be the same.
2. Speed: number generation throughput. Generation speed is very important as many applica- tions demand a huge number of random values for simulations.
3. Portability: the same generator should produce the same sequences of numbers on different computing platforms (either Hardware-based or Software-based platforms).
2.2. Uniform Random Number Generation
Uniform Random Generators (URNG) which provide a uniform distribution, and in particular over the interval [0,1), are the most common generators, as generators following other distributions almost always need a URNG as a base generator. Computer URNGs are mainly based on algorithms relying on integer operations that generate a uniform distribution of integers in the interval [0,m) and then scale them to [0,1) by dividing by m. More complex generators follow the same scheme except that they combine the results from several basic generators before the scaling in order to improve the theoretical properties of the basic algorithms. Developing good URNGs is quite easy (both in hardware and software) and it has been studied in depth [L’E97]. Multiple different URNGs can be found in the literature based on different methods: linear congruential [Leh51], Mersenne Twister [MN98] or Tausworthe [Tau65] generators to name a few. Most of them, although only involving bitwise operations or simple equations, present good quality (high period and good randomness), being the random samples efficiently generated without increasing the complexity of the Monte Carlo simulation. 18 Random Number Generation
In the research field of development of specific URNGs for FPGAs, several works from D. B. Thomas et al outstand as [TL07, TL10], where several URNGs are proposed, studied and developed featuring the specific resources and architecture of FPGAs. The development of new URNGs has been widely studied in the last years and it is out of the scope of this Thesis. However, a brief explanation of the most common methods is presented next to explain why we have selected the Mersenne Twister generator and its advantages with respect to other generators.
2.2.1. Linear Congruential Generators (LCG)
LCG are recurrences with the following form:
xi+1 = (axi + b)modm (2.1) x u = i+1 (2.2) i+1 m where a, b and m are positive integer constants. In this method, as in all others that use recurrence, it is necessary an initial value known as seed x0 (between 0 and m-1), to start generating values. As a recurrence-based method the sequence of random numbers is generated from the seed, in this case following the previous equations. The use of a seed and an equation makes that these generators present two common features:
• For the same seed, the generated sequence is always the same as the algorithm is deterministic.
• As a generated value only depends on the previous generated number, when the seed is repeated, the sequence of numbers starts repeating from that value. In this way, these generators are periodic (m is not infinite, and it is the maximum number of different values) and this period is independent from the seed used.
There are two types of LCG which differ in the value of the constant b. When b is equal to 0 the generator is pure, while if b is not 0 then it is a mixed LCG. For each type there are some conditions to ensure that the LCG has s full period of m numbers (all numbers between 0 and m − 1 are generated before any value is repeated).
2.2.1.1. Pure LCG
x x = (ax )modmu = i+1 (2.3) i+1 i i+1 m 2.2 Uniform Random Number Generation 19
As b is 0, pure LCGs generate uniform values in the interval (0,1). Zero is not achieved because once it is obtained, all subsequent values will also be zeros. Hence the full period has m − 1 values. Pure LCGs are also known as multiplicative LCGs because the conditions between a and m to ensure the full period are multiplicative conditions:
• m is a prime number.
• a is a primitive root of m:
– am−1 is a multiple of m. – aj−1 is not a multiple of m for j=1,2...m-2.
• x0 ≠ 0 (if seed is 0 all generated numbers are 0).
2.2.1.2. Mixed LCG (MLCG)
x x = (ax + b)modmu = i+1 (2.4) i+1 i i+1 m
Now b is not zero, so the full period includes zero as a valid value [0,1). The conditions to ensure the full period are:
• b and m are relative primes (their only common divisor is 1 → gcd=1).
• Every prime number that divides m divides a − 1.
• a − 1 is divisible by 4 if m is divisible by 4.
2.2.1.3. Problems related to LCG
LCG have two main problems. The first one relates to the randomness properties of the method. Each value of the sequence is obtained directly from the previous one, so the randomness achieved is not very good as the correlation between numbers can be high. The second problem is a computational problem. To achieve very high periods very high values of m are necessary and this can create overflow problems in the multiplication. To solve both problems, it is common to employ more complex generators, combined generators, based on the LCG recurrence. There are two types of combined generators. One type uses several previous values to generate the next one (instead of using only the previous one). The other type generates the random variable by combining several LCGs (simple or combined ones). 20 Random Number Generation
2.2.2. Combined Generator Rand2
One of the combined generators with more quality and more extensively used is the one known as rand2 (as it is referred in [PTVF88]). This generator is based on the combination of several multiplicative LCGs [L’E88] according to the next equation:
∑l j−1 xi = ( (−1) sj,i)mod(m1 − 1) (2.5) j=1 where l generators are combined with a periodicity: ∏ l (mj − 1) ρ ≤ j=1 (2.6) 2l−1 In the specific case of rand2, two multiplicative LCGs are combined, while the algorithm is ad- ditionally improved with a mixing technique to mess up the sequence of the generator: the Bays- Durham shuffle [BD76]. This technique consists on the storage of a group of calculated random variables in a table and the random reading of one of them. In this way, in each iteration a random variable is read from a position of the table while the random variable calculated in that iteration is stored in the same position. Thus, the mess up affects both, the sequence of output numbers and the generation of the next value as it depends on the value read in the previous iteration.
2.2.3. Tausworthe Generators
2.2.3.1. Tausworthe-LFSR (basic)
The basic Tausworthe generator is a LCG which combines several values of the recurrence to generate the next value following the equation:
bi = (a1bi−1 + a2bi−2 + ... + akbi−k)mod2 (2.7) where the ai coefficients are equal to 0 or 1 and the bi values are also binary values [Tau65]. As the modulus is a prime number, the linear recurrence is determined by the polynomial:
k k−1 P (z) = z − (a1z + ... + ak) (2.8) and, if P(z) is a primitive polynomial, the linear recurrence will have the full period ρ2k − 1. In this way uniform numbers can be obtained with the equation: 2.2 Uniform Random Number Generation 21
∑L −i un = bn+i−12 (2.9) i=1 where L is the precision of the number that we want to generate. The problem with this type of implementation is that the uniform variable is generated from only one step of one linear recurrence and thereby all its bits keep some correlation between them.
2.2.3.2. Combined Tausworthe
To resolve the above mentioned problem, several options can be considered. First, each bit for ob- taining un can be from a different independent linear recurrence.
∑l −i un = bn,i2 (2.10) i=1 for l independent linear recurrences.
A second option is to obtain un from only one linear recurrence but from s steps of that recurrence:
∑L −i un = bns+i−12 i=1
k In this case, if ρ = 2 − 1 and s are coprime, then un will have a full period equal to ρ. To generate un from un−1, s steps from the linear recurrence are needed. But if some conditions are fulfilled [L’E96]:
• P(z) is a primitive trinomial with the form:
k q P (z) = z − z − 1 => bi = bi−k+q + bi−kmod2
• 0 < 2q < k.
• 0 < s ≤ k-q< k ≤ L. these s steps can be obtained very quickly, see Section 2.5.1.1. Finally, several generators following this scheme can be additionally combined together in just one generator making an XOR bitwise operation of the un of each generator [L’E96]. If their polynomials are coprime, the period of the combined generator will be the least common multiple of the periods of the individual generators. 22 Random Number Generation
2.2.4. Mersenne Twister
The Mersenne Twister [MN98] (MT) URNG presents very high quality while achieving a huge pe- riod, 219937 −1 in its most used configuration. Nowadays, this URNG is widely used for Monte Carlo simulations due to its high quality and high performance. In this generator, groups of w bits are handled as vectors and a linear recurrence is applied to that vectors instead of to single bits:
⊕ u| l xk+n = xk+m (xk xk+1)A (2.11) where x vectors are formed of w bits. Finally, to improve the statistical properties of the generator, the output random numbers of the generator are not the vectors of the linear recurrence. Instead, the numbers generated in the recurrence are modified, tempered, with a bitwise multiplication by a w × w binary matrix. In Section 2.5.1.2, a more in depth analysis of this generator can be found.
2.3. N(0,1) Gaussian Random Number Generation
Most Monte Carlo computationally intensive simulations require of a high quality Gaussian Random Number Generator (GRNG), in particular with Normal distribution N(0,1). Furthermore, the software complexity of this type of generators makes them ideal candidates for hardware acceleration. There- fore, developing a high-quality high-performance hardware GRNG is essential for any Monte Carlo hardware accelerator.
2.3.1. Generation methods
Implementing a N(0,1) gaussian random number generator can be done using several methods as acceptance-rejection, Wallace [Wal96], Box-Muller [BM58] or inversion [BFS83]. Furthermore all of them are suitable for FPGA implementations [ZLL+05, LLZ+05, LLVC06, LCVL06] (in [LCVL06] a results comparison between them can be found). Next, the main features of these methods are briefly introduced.
2.3.1.1. Acceptance-Rejection Methods
These methods are based on the generation of another distribution, similar to the target one, which can be easily generated. The samples from this base distribution are candidates for generating the target distribution and will be accepted or rejected, subsampled, according to a mechanism designed to select candidates of the target distribution. 2.3 N(0,1) Gaussian Random Number Generation 23
The main features of these methods are determined by the inherent nature of the method. On the one hand, the complexity of generating some distributions is reduced as a more easily generated base distribution is used for obtaining the candidates. On the other hand, the samples of the base distribution will be rejected with a probability directly dependant on the difference of density between both distributions, and, therefore, a constant throughput of generated samples cannot be ensured.
2.3.1.2. Wallace Method
The Wallace method is based on the idea of obtaining gaussian variables in the same way as uniform variables are obtained from previous uniform variables, following some recurrence. In particular, this method transforms a vector of K gaussian variables into K new ones using an orthogonal transfor- mation with a matrix. One of the main features of the Wallace method is that it requires an initial pool of normalized gaussian variables whose average square value is one. This feature implies the need of a scaling factor to correct the value of the generated variables. The other main feature is that the correlation between variables has to be carefully handled as new gaussian variables are generated from previous gaussian variables.
2.3.1.3. Box-Muller Method
The Box-Muller method employs a straightforward transformation to convert two uniform variables, u1 and u2, into two gaussian ones, x1 and x2, following the equations:
√ √ x1 = −2 ln u1 cos(2πu2)x2 = −2 ln u1 sin(2πu2) (2.12)
These equations determine the main feature of this method, the use of complex operators to obtain the gaussian variables, and its main difference with other methods, in each iteration two gaussian variables are obtained instead of one.
2.3.1.4. Inversion Method
The inversion method is a general method to generate any probability distribution using the inverse function of its corresponding cumulative distribution function (CDF) and uniform variables. The uniform variables correspond to values of the cumulative probability, so the variables of the desired distribution are obtained calculating the values that generate that probability with the inverse function (ICDF). This conversion will be addressed with more detail in Section 2.3.3. Therefore this method implies a direct transformation of uniform variables into the desired vari- 24 Random Number Generation ables, and its main features cannot be generalized as they will depend on the target distribution and how the inverse function is implemented. However, one very important feature stands out, as the basis of this method is a one to one transformation using the ICDF, the obtained sequence of variables with the desired distribution will keep the structural properties of the uniform sequence.
2.3.2. Monte Carlo Implications and Hardware Implementation
Due to the stochastic nature of the Monte Carlo simulation and according to the Law of Large Num- bers, the convergence of a Monte Carlo simulation is ensured when the number of replications of the system under study grows to infinite. For a finite number of paths an error can be introduced due to the existence of variability in the final result. This variability strongly depends on the variance of the underlying variables. For example, if the arithmetic mean of a function f of the system is to be obtained, its expression, for m simulations, would be:
1 ∑m µ = E[f] ≈ f (2.13) m i i=1 where fi is the evaluation of f for each replication. According to the Central Limit Theorem the standard deviation of the approximated value of µ would be:
Std[f] σ = Std[µ] = √ (2.14) m
The standard deviation implies a confidence interval for the approximated value obtained and the smaller is the interval more accurate is the value. The confidence interval can then be reduced by either increasing the number of replications m or reducing the variance of f. Increasing the number of replications m has a very important drawback. The statistical nature of these methods makes this type of simulations computationally very intensive. The calculation time grows linearly with the number of replications but the standard deviation decreases only as the square root of it. Therefore reducing the result deviation by increasing m will have an important impact on the execution time. However, this increase has a very little impact over the standard error when m is big enough because √1 m decreases very smoothly (Figure 2.1). The second alternative, reducing the variance of f, relies on a set of methods known as Variance Reduction techniques [Gen98, Gla04]. These techniques are based on doing a smart sampling on the space of the underlying random variables so the variance of f is reduced thus reducing the number of replications required and the execution time needed. Meanwhile, these samplings have a limited impact on the calculation time. In Section 2.4 variance reduction techniques will be shown in depth. As the main problem we are dealing with is calculation time, the compatibility of the gaussian generation method with these techniques becomes a must. 2.3 N(0,1) Gaussian Random Number Generation 25
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of simulations
√1 Figure 2.1: Plot of m .
Also related to the calculation time, two more issues have to be handled. To completely approach the FPGAs capacities for large pipelined datapaths, one important feature for a hardware GNRG is that it must generate one random number per clock cycle so the datapath is not halted waiting for new random numbers. Furthermore, the frequency at which the random numbers are generated should be high so the performance of the whole datapath is not limited by the GRNG. In a hardware implementation, this is closely related to the fact that a generation method may be pipelined or not. Finally, one more issue has to be studied: which arithmetic is more suited to generate the random numbers. Gaussian distribution variables are concentrated around the zero value, reaching values like 5.8 × 10−8 (32-bit integer uniform input with the pre-scaled value of 231 + 20), while in the tails of the distribution the gaussian distribution reaches up to 6.23. Therefore the arithmetic employed must provide enough resolution to represent those values with high accuracy and precision considering that the most significant bit can be displaced on more than 20 bits. Hence, we have considered that the generation method should be compatible with floating-point arithmetic as it adapts better to this range of values than fixed-point arithmetic (that will require of a high precision to guarantee high quality samples). In summary, four implications or criteria have to be considered when selecting the most suited method for random gaussian generation:
1. Compatibility with variance reduction techniques. 2. Possibility of generating one sample per cycle. 3. High clock rate or possibility of pipelining to achieve it. 4. Compatibility with floating-point arithmetic.
Due to the combination of these four criteria, we have selected the inversion method as the gen- eration method most suited for Monte Carlo based simulations as it is the only method which can fulfill the four criteria, see Table 2.1. Additionally, the inversion method has another advantage, it is a general method suitable for other target distributions, so the framework developed for an inversion 26 Random Number Generation
Table 2.1: Gaussian Generation Methods-Selection Criteria. Acc-Rej Wallace Box-Muller Inversion Variance Reduction Hard Hard Hard Easy Sample/Cycle No Yes Yes Yes Pipeline Yes No Yes Yes Floating-point Yes No Yes Yes
GRNG can be reused for RNGs of other distributions. With respect to the first criteria, the Inversion method is the only one where variance reduction techniques can be applied on the base uniform variables, while in the other three methods they should be applied over the gaussian variables, Sections 2.3.3 and 2.4. Regarding the second criterion, the acceptance-rejection method cannot ensure one sample per cycle as some samples are rejected. Furthermore, the application of variance reduction techniques over the gaussian variables instead of the base uniform can have an impact on this criterion. The third criterion is fulfilled by all the methods except the Wallace one. Due to the nature of the recursion it uses, if pipelining is introduced, it cannot achieve one sample per cycle. In the other three methods, all the involved mathematical operations can be pipelined, being their base generators (uniform for Inversion and Box-Muller, or the one selected in Acceptance-Rejection) their only part that cannot be pipelined in case they depend on a recursion based on the last generated value. As this base generators are usually very fast, the do not compromise the global clock frequency. Finally, again the Wallace method is the only one that does not fulfill the last criterion. Its recur- sion is between fixed-point arithmetic gaussian samples whose conversion to floating-point samples will imply a degradation of the distribution quality. Although the inversion method can fulfil all the criteria, this is not the case for the inversion-based gaussian RNG that is found in the literature, [LCVL06]. This particular implementation is a fixed- point implementation whose samples cannot be converted to high quality floating-point samples due to its limited quality and resolution, see more details in section 2.5.2.3. Therefore, we have developed a new inversion-based generator. Next, the theoretical approach to the approach we have followed is studied.
2.3.3. Inversion Method with Quintic Hermite Interpolation
As stated before, inversion is a general method to generate any probability distribution using the inverse function of the cumulative distribution function (CDF) of the desired distribution: 2.3 N(0,1) Gaussian Random Number Generation 27
Figure 2.2: Inversion Method.
∫ x CDF (X ≤ x) = F (x) = f(x)dx (2.15) −∞ X = F −1(U) (2.16) where U is a uniform distribution from which the desired distribution, f(x), is obtained being F(x) the cumulative distribution of f(x). In Figure 2.2 the method is depicted for N(0,1). The generated uniform variables correspond to values of the cumulative probability (axis y in the GCDF graphic of the figure). Then the distribution variables are obtained calculating the x values that generate that probabilities with the inverse function. Although this method relies on this simple idea, it has a great advantage: this kind of transforma- tion of uniform samples into the target distribution samples keeps the structural properties of the the uniform sequence, and thereby variance reduction techniques can be directly applied to the uniform variables. With this method, any distribution can be generated if its F −1 is known. However F −1 can be a very complex function, as for Normal Gaussian distribution, N(0,1), so it has to be approximated and numerical algorithms are needed for its computation. 28 Random Number Generation
Table 2.2: Gaussian ICDF Implementation Requirements. √ +/− × / exp ln erfc−1 37 34 3 2 1 2 Interpolation degree n n n - - - -
2.3.3.1. Direct Inversion
The inverse function approximation can be done in several ways and the most direct one is to do a general approximation, where F −1 is approximated by just one function valid for any value in the range (0,1). This method is the most intuitive one, and for any uniform (0,1), the target distribution variable is obtained just by computing F −1. The problem is how to obtain F −1. In our case, the inverse function of the CDF of N(0,1), where CDF is:
1 x 1 x CDF (x) = (1 + erf(√ )) = erfc(√ ) (2.17) 2 2 2 2 where erf and erfc are the error function and the complementary error function. To obtain the inverse function of the CDF it is necessary to obtain the inverse of the erf or the erfc functions. However, both functions are very complex to calculate. In particular, and as can be seen in Table 2.2, the inverse erfc function calculated with the method developed in [Mor83] requires a huge number of operations and among them we can find even exponential and logarithm functions.
2.3.3.2. Interpolated-Segments Inversion
Another way to compute the inverse function is to approximate it by segments or intervals. In this case, the range of the inverse function is divided into several segments, and then the inverse function is approximated in each segment by interpolation following a polynomial equation. Unlike the previous method, interpolation inversion is not a direct method. Before the inversion is calculated for any uniform value, the segment that corresponds to the uniform input value has to be determined. Then, the inversion is made by applying the interpolation of that segment to the uniform value. The most common technique to carry out interpolated segments inversion is to use the same type of interpolation for all the segments. Thereby, the same general inversion function is used for all the segments, differing the polynomial coefficients for each segment but not the operations. A very important advantage of this method is that the coefficients, for any type of interpolation, do not change while the starting and ending points of each segment do not change. This means that they do not have to be calculated in execution time, so we can calculate them in advance and store 2.3 N(0,1) Gaussian Random Number Generation 29 them in memory tables. In execution time, for each uniform value, the memory tables are accessed to read the coefficients corresponding to its calculation segment and are then applied to the interpolated inverse function. The overall characteristics of the inversion by interpolation will depend on the interpolation method, on the selected degree selected and on the segmentation policy. These features will de- termine the quality of the inversion (how accurate the approximation is) and how many segments are needed for a given accuracy. There exist several methods for splines interpolation as:
• Chebyshev
• Legendre
• Jacobi
• Hermite
From them, Hermite interpolation outstands due to three facts [HL03]. Firstly, the better results obtained for the same interpolation degree with respect to the other methods, as it takes into account the density of the function. Secondly, Hermite interpolation is a local approximation and its accuracy can be improved just by introducing more segmentation points where needed without needing to recalculate all the segments. And finally, F −1 is a monotonically increasing function and Hermite interpolation monotonicity properties ensure the interpolation will also be monotonically increasing. One last advantage is that the calculation of the polynomial constants is easier than with the other methods.
2.3.3.3. Quintic Hermite Interpolation Equations
In [HL03] several polynomial degrees of Hermite interpolation (linear, cubic and quintic) are studied. The quintic interpolation obtains the best results in terms of accuracy (almost exact) and smaller number of segments. Therefore, we have selected quintic interpolation to develop a GRNG.
For each segment of the GCDF [pi,pi+1] where ui = CDF (pi), the inverse function is interpo- lated with a 5 degree polynomial:
5 2 3 4 5 Hi (¯u) = ai0 + ai1u¯ + ai2u¯ + ai3u¯ + ai4u¯ + ai5u¯ (2.18) where u¯ = u − ui with u being the uniform variable to invert, ui the starting point of the segment and ui+1 the starting point of the next segment. For each segment the values of the coefficients are [DEH89]: 30 Random Number Generation
ai0 = pi (2.19) 1 ai1 = (2.20) fi ′ − fi ai2 = 3 (2.21) 2fi 3f ′/f 3 − f ′ /f 3 10∆s − 6/f − 4/f a = i i i+1 i+1 + i i+1 (2.22) i3 2∆u ∆u2 −3f ′/f 3 + 2f ′ /f 3 −15∆s + 8/f + 7/f a = i i i+1 i+1 + i i+1 (2.23) i4 2∆u2 ∆u3 f ′/f 3 − f ′ /f 3 6∆s − 3/f − 3/f a = i i i+1 i+1 + i i+1 (2.24) i5 2∆u3 ∆u4
′ ′ − − ∆p −1 ′′ where fi = f(pi), fi = f (pi), ∆u = ui+1 ui, ∆p = pi+1 pi, ∆s = and (F (ui)) = ′ ∆u −f (pi) 3 . f(pi)
2.4. Variance Reduction Techniques
As introduced in Section 2.3.2, Variance Reduction Techniques are essential to reduce the total sim- ulation time required by a Monte Carlo simulation, as they reduce the total number of replications required to obtain the result with a given confidence interval. To reduce the variance between the re- sults of the different replications, these techniques modify the sequence of random numbers provided by the random number generator to cover the space of the underlying variables in a better way than with a pure random sequence. To do it, variance reduction techniques introduce some dependencies in the random numbers. These dependencies must be introduced across replications, since if they were introduced within replications the result obtained would be absurd. Among these techniques we can outstand:
• Control Variates.
• Antithetic Variables.
• Stratified Sampling.
• Latin HyperCube.
• Importance Sampling.
An extensive description of these techniques can be found in [Gen98, Gla04]. However, in this Thesis we are going to focus on two of them, Stratified Sampling and Latin Hypercube as they are the most general of the techniques and can be applied in a similar way to all Monte Carlo simulations. 2.4 Variance Reduction Techniques 31
2.4.1. Stratified Sampling and Latin Hypercube
The stratified sampling technique is based on the idea of better covering the space of a random variate by dividing it into n subsets or stratas Ai. Let’s assume that f depends on the random variate r. Then the expected value of f is then calculated according to:
n∑−1 E[f] = E[f(r)|r ∈ Ai] · P (r ∈ Ai) (2.25) i=0 where E[f(r)|r ∈ Ai] is calculated by Monte Carlo simulations being P () the probability distribu- tion. The number and definition of the stratas are chosen to minimize the variance of the average estimator. If we choose equally likely stratas the expected value now is given by:
− 1 n∑1 E[f] = E[f(r)|r ∈ A ] (2.26) n i i=0
In this case, if the needed random samples follow a uniform distribution [0,1), their generation to calculate E[f(r)|r ∈ Ai] is easily carried out by means of the following equation:
u i us = i + ; with i = 0, ..., n − 1 (2.27) i n n
i i where each n is the starting point of each strata and the uniform variables ui are scaled to the strata size.
When r does not follow a uniform distribution, the calculation of E[f(r)|r ∈ Ai] by means of the Monte Carlo method is more compromised due to the complexity of sampling r according to its distribution function P [r ∈ Ai]. Stratified sampling can also be extended to multidimensional variates. Each dimension is parti- tioned into n stratas giving a total number of nd stratas, being d the number of dimensions. This makes the method unfeasible when the number of dimensions is high due as the required number of multidimensional variates also will follow this exponential growth. Thee Latin Hypercube [Gen98] can be understood as an alternative extension of stratified sam- pling that does not suffer the problem of dimensionality. Latin Hypercube consists on stratifying the one-dimensional marginals of a joint distribution instead of stratifying the whole multidimensional space. Therefore the number of multidimensional variates grows linearly with the number of dimen- sions. For the sake of illustration we will outline the sampling procedure for a multidimensional independent uniform variate. Let’s call Π0, Π1, ..., Πd−1 to d permutations of {0,1,...,n-1} drawn in- dependently assuming that the n! different permutations are equally likely. Random samples of the multidimensional uniform variate can be obtained according to: 32 Random Number Generation
(a) Stratified Sampling (b) Latin Hypercube.
Figure 2.3: Dimension Impact: Stratified Sampling & Latin Hypercube.
uj + Π (i) vj = i j j = 0, ..., d − 1, i = 0, ..., n − 1 (2.28) i n
j where the ui are independent draws of a one-dimensional uniform random variate. In Figure 2.3 the difference between both techniques is graphically depicted for two dimensions and four strata. As can be seen, with Stratified Sampling and d=2 and n=4 it is required to work in groups of 16 replications as 16 random variables. Meanwhile, with Latin Hypercube only groups of four replications are needed. This feature has a great importance in Monte Carlo simulations (where the model to simulate is a temporary model) where the simulation of each time step requires different random variables. In these cases we need to work with groups of x replications for the total number of time steps, t, and requiring x × t random numbers to whom variance reduction techniques have been applied. This forces high memory requirements and will determine how to simulate, see Section 5.3.1.
2.5. Developed Gaussian Random Number Generator
2.5.1. Uniform Random Number Generator
Inversion-based RNGs inherit the statistical properties of their base uniform random generator. Thereby, one of the characteristics that must fulfill a URNG used in a Monte Carlo simulation is the high qual- ity of the generated random sequences: good randomness, high period and good statistical properties. Among the generators mentioned in Section 2.2, Rand2, MT and the combination of several basic generators, fulfill these characteristics. As exposed in Section 2.3.2 one of the desired features we are searching for the hardware im- plementation is the generation of one random value per clock cycle and with the highest possible 2.5 Developed Gaussian Random Number Generator 33 frequency. Rand2 and the combined generators require of the current generated value to compute the next value and, therefore, if one value is generated per cycle no pipeline can be introduced. This cir- cumstance is not a problem for basic combined generators as they are composed of very fast bitwise operations, but it discards the use of Rand2 as all operations performed in just one cycle are complex and slow (two multiplications, two modulus, two additions, one division and one subtraction). Therefore we have selected two URNGs to develop and use them as base uniform random gener- ators, a combined Tausworthe generator and the MT.
2.5.1.1. FPGA Tausworthe 88
The Tausworthe 88 [L’E96] is a well known URNG which combines three of the basic Tausworthe generators presented in Section 2.2.3.2. The sequence of uniform variables is obtained by combining, using an XOR operation, the three variables obtained from the basic generators. The three basic generators combined in Tausworthe 88 (or Taus88), are three trinomials for se- quences of 31, 29 and 28 bits respectively, with full period and the following set of parameters (k,q,s), see section 2.2.3.2:
• P 1(31, 13, 12)
• P 2(29, 2, 4)
• P 3(28, 3, 17)
Therefore, as the trinomials used are pairwise relatively prime the period of the combined gener- ator is 288. In software, the recurrence of each of the generators can be quickly calculated; given the vector
A with the un−1, the vector B for temporal storage and the vector C containing a mask composed of k ones followed by L − k zeros, un can be calculated in six operations:
1. B ← A left shifted q bits.
2. B ← A XOR B
3. B ← B right shifted k − s bits
4. A ← A&C
5. A ← A left shifted s bits
6. A ← A XOR B
This is even more simplified in hardware, due to the possibility of working at bit level, so these six operations are reduced to just an XOR and the concatenation of bits: 34 Random Number Generation
i − − i − − Un+1(31 : 32 (k s)) = Un(31 s : 32 k) (2.29) i − − i − − ⊕ i − − − − Un+1(31 (k s) : 0) = Un(31 : k s 1) Un(31 q : k s q 1) (2.30) while the complete generator implies two more XOR operations
1 ⊕ 2 ⊕ 3 Un+1 = Un+1 Un+1 Un+1 (2.31)
2.5.1.2. FPGA Mersenne Twister
In the literature some previous work focused on the hardware implementation of the MT URNG can be found in [SK06, CA08, TB09]. The MT generator is highly parameterizable and these works are focused on the most used configuration, MT19937, due to its high quality and the simplifications that its set of parameters introduces. Hence, we also considered that set of parameters as the ideal for FPGA. However, none of the previous implementations fulfills all the characteristics we are searching for. [SK06, CA08] present a low frequency and [SK06] does not provide one sample per cycle. Meanwhile, in [TB09] the first part of the algorithm is not implemented in hardware and therefore it is not a complete implementation. Hence, it requires a HW-SW overhead that precludes us from this implementation. Opposedly, we are looking for a hardware MT with the following features:
1. All in Hardware. 2. Capable of generating one sample per cycle. 3. Efficient in area and performance.
Following the general MT algorithm is described while the simplifications due to the set of pa- rameters of MT19937 are introduced. The algorithm is split in three different tasks:
1. Initialization: the generation from a seed of the first n vectors of the recurrence. 2. Obtaining the linear recurrence. 3. The tempering of the generated variables from the linear recurrence.
Firstly, an initialization from the seed is needed as the linear recurrence requires a work area of n variables. This initialization takes the seed as the first element of the recurrence, x0, while the other n − 1 variables are generated following the recurrence:
xi = 1812433253 × (xi−1 ⊕ (xi−1 >> 30)) + i (2.32) 2.5 Developed Gaussian Random Number Generator 35
Figure 2.4: Mersenne Twister General FPGA Architecture.
Once the first work area is obtained, the initial n variables, the random numbers are calculated following equation 2.11:
⊕ u| l xk+n = xk+m (xk xk+1)A (2.33)
u − l where xk means the w r most significant bits of xk, and xk+1 corresponds to the r less significant bits of xk+1. To make the computation of the multiplication fast, the matrix A is selected in such a u| l way that (xk xk+1)A is reduced to:
u| l (xk xk+1) >> 1 when xk+1(0) = 0 (2.34) u| l ⊕ ((xk xk+1) >> 1) a when xk+1(0) = 1 (2.35) where a is the wth row of matrix A. Therefore, the equation is mostly reduced to XOR operations. Finally, each random number obtained in the linear recurrence is tempered, modified to improve its statistical properties by multiplying the random sample by a matrix T , z = xk+nT . Again, matrix T is selected in such a way that this multiplication is simplified as several logical bitwise operations:
y = xk+n ⊕ (xk+n >> u) (2.36)
y1 = y ⊕ ((y << s) && b) (2.37)
y2 = y1 ⊕ ((y << t) && c) (2.38)
z = y2 ⊕ (y >> l) (2.39)
As just seen, the MT URNG depends on multiple parameters (w, n, m, r, a, u, s, b, t, c, l) 36 Random Number Generation
(a) MT Three port Table. (b) MT Circular Buffer.
Figure 2.5: MT work area Storage. corresponding in MT19937 to the set (32, 624, 397, 31, 9908BODF, 11, 7, 9D2C5680, 15, EFC60000, 18). The hardware implementation of the MT generator has to deal with the three tasks (initialization, recurrence, tempering) while a storage element is needed for storing the samples composing the work area, as can be seen in Figure 2.4. From the previous equations it can be easily deduced that the linear recurrence and the tempering tasks fulfill the criteria of obtaining one sample per cycle while achieving a high clock rate, as they are mostly composed of just XOR bitwise operations. Furthermore, due to the depth of the work area and the dependencies among samples, the logic of both tasks can be pipelined to increase the clock rate. Meanwhile, the initialization task also requires a multiplication and an addition. Although this task will not be working once the first n-sample work area is generated, in reconfigurable hardware the slowest task determines the maximum clock rate. Hence, the clock rate of the MT generator is also determined by the more complex and slower initialization logic and therefore, pipelining this task is a must if a high clock rate is desired. Another important fact to take into account is the storage element. A n-word storage area is needed for the linear recurrence equation, and due to the nature of the linear recurrence, two storage options are suitable, see Figure 2.5: a storage table and a register buffer. In the first case, a three port table is needed. It can be implemented with two dual block-RAMs or distributed logic and the logic required for the indexes. The three ports required are one for reading xk+1, another for reading xk+m and another for writing xk+n, while xk is obtained from xk+1, see Figure 2.5(a). In the second case, a buffer of registers can be used as the relationship between the indexes of the words is fixed and in each step of the recurrence xk is replaced by xk+n in the work area. In this way, the linear recurrence and the buffer of registers can be considered like a circular buffer where the linear recurrence is some combinational logic between the input and the output of the buffer, see Figure 2.5(b). 2.5 Developed Gaussian Random Number Generator 37
Table 2.3: Virtex-4 XC4VFX140-11. Table of resources. Slices DSP 18 KB BRAMs 63168 192 552
Table 2.4: URNG Implementation Results.
Mersenne Twister 19937 Taus88 Logic Init. 1 Cycle Init. 2 Cycle Init. 3 Cycle Init. 4 Cycle [TB09] CB Table CB Table CB Table CB Table V4-FX100 Slices 77 73 807 161 819 158 816 153 816 183 128 BRAM - - - 4 - 4 - 4 - 4 4 DSP - - 3 3 3 3 3 3 3 3 - Start Cycles 1 - 625 625 1249 1249 1873 1873 2497 2497 624 MHz 943.4 646.8 108.2 105.1 214.5 185.3 256.3 314.2 339.3 345.9 265
2.5.1.3. Implementation Results
The Xilinx Virtex-4 XC4VFX140-11 will be the reference FPGA for the implementation results of the subcomponents and cores. Meanwhile, for the whole accelerator implementation the reference FPGA will be the Xilinx Virtex-5 XC5VFX200T-2, Chapters 5 and 6. In Table 2.3 a summary of the resources of the Virtex-4 is available. In Table 2.4, the results for the developed URNGs are summarized. The abbreviations CB and Table refer to the circular buffer and the three port table MT implementations respectively. Addition- ally, the results for the most representative FPGA MT in the literature, [TB09], are presented (results from [SK06, CA08] are not included since they are incomplete and their clock frequency is below 40 MHz). The logic column of the MT generator comprises the results for just the combinational logic of the linear recurrence and the output tempering (introducing a register between them), without taking into account the logic needed for the initialization nor the storage element. Hence, this frequency result represents the maximum clock frequency achievable for a MT with a throughput of one sample per cycle. In the complete generator, this performance is reduced due to two facts, the above mentioned delay of the initialization logic and the need of storage elements. As can be seen in the table, to the clock cycle improves as more pipeline stages are introduced in the initialization, but only up a certain limit given by the storage elements. For both implementations and four stages the slowest path is determined by the storage elements. This circumstance especially affects to the circular buffer implementation as the complex routing of the shift register limits the frequency at which the registers are shifted in the buffer. Meanwhile for the table implementation, the frequency limit is given by the update of the addresses of the table. Tausworthe 88 presents a really high clock frequency for an FPGA, close to 1 GHz, with an almost negligible use of resources. 38 Random Number Generation
2.5.2. N(0,1) Gaussian Random Number Generator
Following the discussion in Section 2.3 the Inversion-based N(0,1) GRNG with quintic Hermite in- terpolation for the approximation of the ICDF has been considered as the most suited GRNG for a Monte Carlo simulation. Clearly, two different tasks are required for any inversion-based RNG relying on this technique, a first setup stage, where the calculation of the function approximation is carried out, and a second task where the random variables are generated. Thereby, in the first task, the segmentation of the function range and the calculation of the coefficients for each segment are carried out. In the second task, the random variables are generated from uniform random variables: 1. Generation of a uniform random variable u. 2. Search for the segment corresponding to u. 3. Extraction of the coefficients for that segment. −1 5 4. Calculation of GCDF (u) applying Hi (¯u). This methodology fits very well with a hardware architecture. While no changes are done in the segmentation policy or in the type of the interpolation used, the setup stage produces always the same results (the polynomial coefficients and the starting points, ui, of each segment). These are the data required by the generator and as they remain unchanged they can be stored in tables as ROM data. Hence, the setup stage only needs to be computed once. Furthermore, there is no need to do it in the hardware RNG. To simplify the complexity of the RNG, the interpolation can be realized in a software platform which is also in charge of generating the necessary text files in some hardware description language containing the ROM table with the polynomial coefficients and the segment’s starting points. A software implementation of a GRNG with Hermite interpolation has been previously studied by Hörmann et al in [HL03] and can be found in the software tool UNU.RAN [LH02]. We have used some of the framework provided by this tool for developing our hardware GRNG.
2.5.2.1. Base Software Tool: UNU.RAN
The UNU.RAN tool was developed by Hörmann and Leydold to provide a software tool able of au- tomatically developing software random number generators with several distribution functions, using the inverse method with Hermite interpolation as generation method. Additionally, in [HL03] the theoretical framework of the tool is explained and the results obtained for several distributions such as gaussian, exponential, cauchy, etc with different interpolation degrees (linear, cubic and quintic) and approximation error bounds are provided. These results demonstrate that quintic interpolation provides the best results in both the error bounds that can be achieved and the number of segments that are required for obtaining that bounds. 2.5 Developed Gaussian Random Number Generator 39
Figure 2.6: UNU.RAN segmentation.
2.5.2.1. Segmentation
In the UNU.RAN implementation, the segmentation algorithm depends on the accuracy. The desired accuracy is a configurable parameter that determines when a segment is considered as a definitive segment. In UNU.RAN the accuracy is related to the axis x of the inverse function (axis y of the CDF) and is determined as a maximum error allowed in that axis.
| n − | ϵˆu = CDF (Hi (u)) u were u is the uniform value and n the degree of accuracy. The input uniform value is compared with the CDF value of inverting u. As CDF is an exact function this accuracy policy also ensures the accuracy of the inversion. Segmentation is done taking into account the whole function range of the inverse function [0,1] and of the CDF (−∞, ∞). In addition to the accuracy, two more parameters are required for the segmentation: the maximum size of each interval measured as ui − ui−1 and the probability to chop in the tails of the CDF function. Several functions as the gaussian, N(1,0) spread to ±∞. However, no interpolation can be done when reaching ±∞ while the probability of these tails is insignificant. Therefore, interpolation can be restricted to a much smaller range (to the range that concentrates almost all the probability). In this way, the starting and ending points are selected by the configuration of the probability we want to chop. Once the CDF range is delimited, the segmentation is done according to the selected accuracy and the maximum segment size by an iterative method, see Figure 2.6. In each pass, the segmentation algorithm considers the segment corresponding to all the CDF range that has not been segmented yet (initially all the CDF minus the chopped tails). If the segment is bigger than the maximum segment size set, the segment is halved as many times as necessary until it is smaller than the maximum size. Then, the obtained segment is interpolated and the coefficients of the interpolation are calculated. To figure out if the segment fulfills the selected accuracy, the function is calculated at the middle point of the segment (in the uniform axis) using the coefficients calculated 40 Random Number Generation for the segment. The middle point is selected because interpolation is exact at the endpoints, so the bigger error is expected in the middle of the segment. While the error measured in the middle point is bigger than the maximum error allowed, the right endpoint of the calculation segment is modified in order to divide the calculation segment by two. In this way the calculation segment gets smaller and smaller as its interpolation becomes more accurate, until the error at its middle point is below the maximum error allowed. At this point the segment is considered a valid segment, and the segmentation continues with the rest of the range.
2.5.2.1. Search algorithm and generation of gaussian variables
The first task to transform a uniform variable, u, into a gaussian one is finding the segment to which u belongs. The gaussian inverse CDF is monotonically increasing, so it will be the resulting segmen- tation so ordered search methods can be used. In this case, an indexed search is used with an index table which contains pointers to the table containing the starting points of the segments, the search table. With the u value an address to the index table is obtained, that returns a pointer to the search table. This pointer points to the correct segment or some close segment below it with ui < u. In the case the pointed segment is not the correct one, the pointer is increased by one as many times as needed until the correct segment is reached. Finally, once the corresponding segment is found, the coefficients and the starting point of the segment are extracted and the gaussian variable is obtained following equation 2.18.
2.5.2.2. Hardware GRNG
The software UNU.RAN implementation presents a challenge to handle in the case of a hardware implementation and the features we have defined for a Monte Carlo GRNG. The UNU.RAN search algorithm is multicycle, being variable the number of cycles required for each search. This is in clear conflict with the feature of one random sample per cycle with a high clock frequency as the variable number of cycles implies that the throughput will no longer be of one sample per cycle. Therefore, developing a hardware oriented search algorithm that can provide one search per cycle is a must for the hardware implementation, see section 2.5.2.2.3. Additionally, multiple modifications can be introduced to the UNU.RAN method to adapt it to hardware:
1. Segmentation and architecture oriented to hardware uniform for URNGs which typically pro- vide uniform samples as 32 bit integers.
2. Accuracy and architecture oriented to the desired output arithmetic.
3. Simplification of equations to reduce the number of operators and the resources used.
4. Tailored internal arithmetic. 2.5 Developed Gaussian Random Number Generator 41
In the following sections our floating-point GRNG will be shown and these modifications to- gether with other improvements and changes will be explained. To develop the hardware GRNG the following steps have been carried out:
1. Adaptation of the segmentation algorithm to hardware GRNG.
2. Analysis of the segmentation results and analysis of the impact in the architecture.
3. Development of a hardware oriented search algorithm.
4. Architecture design and general improvement of the setup stage in order to simplify the archi- tecture.
2.5.2.2. Accuracy-Adaptive Segmentation Policy
Given a desired accuracy, the range of a function can be segmented in two different ways according to two objectives, to obtain the smallest number of segments or to achieve an efficient segmentation in terms of the segment search. As explained before, segment search can compromise the performance of the GRNG taken that search algorithms are usually multicycle. To avoid this problem, segmentation points can be chosen in such a way that segment search of the uniform variable is restricted to analyze the value of some of its bits. One example of this segmentation is the Hierarchical segmentation used in [LCVL06]. However, selecting the segmentation points based on their search easiness has the negative effect of increasing the number of segments needed to achieve the desired accuracy, and for very high accuracies the number of segments is too large. To obtain the lowest possible number of segments (and therefore requiring the smallest tables), an accuracy-adaptive segmentation method, based on the iterative one employed in [LH02], has been used. Our method introduce several modifications to the original one. First, we have adapted the method to the uniform values that we use in the hardware architecture, 32 bit integer values (from "00000001" to "FFFFFFFF", the zero value is not considered as F −1(0) = −∞), instead of dou- ble floating-point values. Hence, the segment’s initial points will always match a uniform number of the hardware range and will allow a search algorithm based on integer search. Second, we have changed how the accuracy related to the interpolation error is measured, from an absolute error mea- sured in the middle point on the uniform axis, to a relative error in the gaussian axis using direct inversion [Mor83]. In this way, the true error is measured and not approximated as before. By using a relative error we can ensure that, for all segments, all generated floating-point gaussian variables generated have the same number of mantissa bits of accuracy. With this method we obtain non-uniform, non hierarchical segments adapted to the desired accu- racy (number of accurate bits of mantissa). 42 Random Number Generation
2.5.2.2. Segmentation and Coefficient Analysis
The analysis of the results obtained for both the original segmentation method and ours has been realized using a double floating-point arithmetic and different searched error bounds. This analysis has focused on three aspects: accuracy, number of segments and value of the coefficients. With respect to the first two aspects, the analysis proved that the final accuracy obtained (for a reasonable number of segments) is limited by the interpolation in the boundaries of the GCDF−1, where the function tends to infinity and its value can be represented using single floating-point arithmetic. Hence, this precision can be adopted without significant accuracy loss. At the boundaries, GCDF−1 is hardly interpolated and the last segments are very small, only containing one point of the possible uniform values. Regarding the coefficient values, it has to be considered that GCDF−1 is an odd function around 0.5, so the generated Gaussian random variables obtained from u and 1 − u will only differ in the sign. In this way, it is only needed to segment half of the GCDF−1 to implement the inversion. The coefficient values obtained can be clearly differentiated between the segments before and after 0.5. All segments before 0.5 have negative coefficients, while for the segments after 0.5, only the segment starting at 0.5 (only one coefficient and very close to zero) and the segments at the GCDF−1 boundary (those containing only one possible uniform value) have negative coefficients. Additionally, considering single precision, all of the coefficients are normalized numbers and their multiplication by the minimum u¯ value 2−32 (zero is not considering) are also normalized numbers. Negative coefficients imply subtractions in the polynomial requiring of adder/subtracters which are more expensive than adders. Thus the upper half of GCDF−1 has been selected to implement the inversion and its negative coefficients have been eliminated. Replacing the negative coefficient of the segment starting at 0.5 by a zero has no consequences on the global accuracy (its value is very close to zero and much smaller than the other segment coefficients), while segmentation in the boundary has been recalculated with linear interpolations including two uniform values. In this way, since GCDF−1 is monotonically increasing, the interpolations of these segments have only positive values (ai0 and ai1 and the rest are zeros) and the accuracy in the boundary is improved while the number of segments is reduced.
2.5.2.2. Search Unit. Search Algorithm
The search algorithm we have developed is the key factor for obtaining one sample per cycle and high throughput. Non-uniform, non-hierarchical segmentation makes necessary some kind of hardware adapted search algorithm to overcome the multicycle search of software algorithms. Direct adaptation of a software search algorithm will have a negative impact on the FPGA im- plementation because it requires multiple accesses over a search table. Handling multiple accesses requires either stalls, compromising the pipeline throughput of one sample per cycle, or the duplica- 2.5 Developed Gaussian Random Number Generator 43 tion of search tables (as many as the maximum number of searches needed) through different pipeline stages increasing the resources used. Taking as basis the search algorithm of UNU.RAN and taking advantage of the characteristics of nowadays FPGAs, with dual port RAM memory blocks, which allow the simultaneous reads of two different positions, a pipelined hardware oriented search method relying on an index table and a search table has been developed. The idea is simple, we construct the index table in such a way that for any u the obtained pointer to the search table always points to the correct segment or just the previous one. In this way, and due to the availability of dual port RAMs in FPGAs, each search can be finished with just one access to the indexed table to obtain the pointer and another one to the search table reading simultaneously the segment’s starting points at pointer and pointer+1. A subsequent comparison of the searched value with the segment starting point at pointer+1 will determine to which of the two segments belongs the searched value. To obtain the desired index table (in the setup stage), a local search scheme based on fixed-point arithmetic and multiple local index tables was developed (the searched variables are 32 bits fixed- point variables). The value range of searched variables is divided into several parts each one with its own local index table that ensures a single access search. The range division into local tables is done in an iterative way:
1. Selecting a subset of bits from u that can be used as an address. 2. With that subset, form a local table until there is a value that will point to a segment that is not the correct one or the one below it. 3. Set the undivided range until the uncorrect value as the range for that local table. 4. Update the range to divide by subtracting the set range. 5. Repeat the previous steps until all the range is indexed by local tables.
A global index table comprises all the local index tables. To obtain the correct address to the global table a bit comparison scheme is used. First, the input u is compared against the values that define the boundaries between the local tables. Once determined to which local table u belongs, the address is formed by adding the position at which the local table starts within the global one to the corresponding subset of u bits.
2.5.2.2. Architecture
The architecture needed for an inversion based RNG is basically the same independently of the gener- ated distribution. Only one modification can be introduced. As explained in Section 2.5.2.2.2, when the desired distribution is an odd function with respect to its middle point, only half of the CDF−1 has to be implemented. In this way we can reduce to half the number of segments required and con- 44 Random Number Generation
COEFFICIENT TABLES
U(0,1) U [0.5,1) address i URNG FOLD SEARCH UNIT UNIT
a 5 a 4 Ui
a 3 U a 2
FORMAT U fixed f.p. a 1 UNIT gaussian a 0 sample value
sign
Figure 2.7: Inversion N(0,1) RNG architecture.
INDEX TABLE local table 0
index u ADDRESS address p+1 SEARCH 1 UNIT TABLE
local table k p up ui up+1 u address p i p+1
Figure 2.8: Search unit. sequently the resources required for storing tables. In these cases, some logic must be introduced to transform the uniform random samples in the range (0,1) to the range [0.5,1). Figure 2.7 shows the architecture of our N(0,1) RNG. As it can be observed, the 5-degree poly- nomial is calculated with Horner’s rule1 reducing the calculation of the polynomial to five multipli- cations and additions. Apart from the polynomial calculation and the base URNG, the architecture is composed of four main units. The Fold Unit represents the extra logic to transform the uniform samples to the range [0.5,1) (bitwise negation and plus ’1’). The calculation segment for the uni- form variable is obtained in the search unit, Figure 2.8, which implements the algorithm explained in Section 2.5.2.2.3. Finally, a Format Unit transforms the fixed-point representation of u¯ into a tailored floating- point representation. The use of standard floating-point arithmetic instead of fixed-point arithmetic introduces a very heavy computational overhead, see Chapter 3. However, in the implemented N(0,1) RNG Infinities and NaN or denormalized numbers can never be produced due to the coefficients we have computed and the operations we have done. Hence, the floating-point arithmetic needed is greatly simplified as it only needs to handle nor-
1 p(x) = a0 + x(a1 + x(a2 + x(...))) 2.5 Developed Gaussian Random Number Generator 45
Table 2.5: Accuracy and Segmentation. Searched Accuracy 21 31 34 44 47 Obtained Accuracy 20 20 19 24 20 Segments 185 464 779 3939 11519
Table 2.6: Maximum segment size - Number of segments (searched 21 mantissa bits of accuracy) Segment size 28 27 26 25 24 23 Segments 113 112 122 147 202 327 malized numbers and zeros (handling denormalized numbers composes most of the prenormalization and normalization) and adders do not need the logic required for subtraction.
Finally, the floating-point representation is also based on the arithmetic used for u¯ = u − ui. u¯ is one of the operands for the five multipliers and the way it is represented significantly affects the resources used by the pipeline and the logic needed in the multipliers. Thereby, the tailored floating- point format is selected according to the range of values of u¯ whose maximum value is determined by the largest segment (equal or smaller than the maximum segment size selected for the segmentation) while its minimum value is zero.
2.5.2.3. GRNG Implementation Results
The searched accuracy on the setup stage has been of 21 bits of mantissa (22 bits of accuracy counting the floating-point hidden bit). This accuracy has been determined by experimental trials with our modified software but with the original maximum segment size, 0.05 in the u range, see Table 2.5 where the search accuracy and obtained accuracy are set in mantissa bits. In the table some representative accuracy results are depicted. Although the search accuracy is increased, the results detect that although the biggest part of the range fulfills the searched accuracy, there are always several segments close to the tail where the accuracy diminishes around to 20 man- tissa bits. This is due to the fact that the accuracy-segmentation algorithm measures accuracy in the middle point of a segment, but the error approximation can be slightly bigger in points close to the middle. According to the results and as the accuracy obtained is in the order of single precision, we have also decided to use single precision coefficients instead of double precision due to the impact of using double precision in resources and performance in an FPGA, and as the accuracy obtained has been of the same order, 20 mantissa bits. In Table 2.6 the segmentation results for 21 mantissa bits, single precision segmentation and several maximum segment sizes, now measured in bit-widths, are summarized. To get advantage of the maximum capacity of FPGA Block RAMs (256 32-bits word for a Virtex- 46 Random Number Generation
4) while improving the segmentation and the hardware architecture minimizing the number of bits needed to represent u¯, we haver selected 24 bits as maximum segment size. Additionally, the seg- mentation has been expanded to 256 segments. The extra segments come from the use of linear interpolations in the extreme to ensure the accuracy in the tails. With this segmentation the segment search is split in two. The function range is easily approxi- mated in most of the range, starting at 0.5, so all the first segments have the maximum segment size. Meanwhile the last segments, the ones belonging to the tails of the functions, just comprise two points with a linear interpolation. In both cases, a direct search based on some bits of the numbers is possible (forcing the first and the last segment search): Logic Search: • Segment 0: direct search < 0x80000002 → address 0 • Segments 1-126: direct address → ’0’ & bits(30-24) +1 • Segments 178-254: direct address → bits(8-1) • Segment 255: forced address → address 254 Meanwhile, for the remaining range the developed search algorithm is employed with local index search tables of 128 positions each:
• Local table 1: Segments 127-144. Search bits (25-19) • Local table 2: Segments 145-160. Search bits (20-14) • Local table 3: Segments 161-176. Search bits (14-8) • Local table 4: Segments 177-182. Search bits (9-3)
In previous works another inversion-based GRNG can be found [LCVL06]. This GRNG is based on a splines approximation of degree two with Chebyshev coefficients and a hierarchical segmenta- tion of the function range. However, this GRNG is not suited for Monte Carlo simulations: it does not ensure very high accuracy due to the usage of 16-bits fixed-point arithmetic and the selected inter- polation degree and coefficients. Additionally, 16-bits fixed-point representation discard it for a wide range of Monte Carlo applications. In Table 2.7, we reproduce the results of [LCVL06] of our GRNG. [LCVL06] was implemented in a Xilinx Virtex-II XC2V4000-6, so the results of our GRNG are also obtained for that FPGA (VX-II GRNG). The combination of a higher degree interpolation, and the use of Hermite coefficients and floating- point arithmetic allows to approximate the CDF − 1 much more accurately. In particular, seven more bits of absolute accuracy2 with respect to [LCVL06] in our worst case. And in terms of relative accuracy3 the improvement achieved is more than five orders of magnitude.
2Maximum error value between GCDF−1 and the polynomial interpolation. 3Maximum percentage error in a Gaussian variable 2.5 Developed Gaussian Random Number Generator 47
Table 2.7: FPGA N(0,1) RNG results.
Slices Block DSP Clock Throughput Max Absolute Relative Sample RAM Mult [MHz] [MSample/s] σ Accuracy Accuracy Format Lee [LCVL06] 585 1 4 231.0 231.0 8.2 0.3 × 2−11 30% fixed 16b VX-II GRNG 1954 5 20 179.4 179.4 6.23 0.5 × 2−18 0.000047% 32b single f.p. VX-4 CDF −1 1684 5 20 236.1 236.1 6.23 0.5 × 2−18 0.000047% 32b single f.p. VX-4 GRNG 1757 5 20 220.8 220.8 6.23 0.5 × 2−18 0.000047% 32b single f.p.
Control U(0,n−1) Unit zero detector URNG ui (0,1) s ui Replace Unit
Figure 2.9: Variance reduction general hardware architecture. However, the improved accuracy comes at the cost of increasing the number of resources: the tailored single precision floating-point arithmetic used instead of the fixed-point one, the higher in- terpolation degree and the sample generation of 32 bits (24 of mantissa) instead of 16 are the reasons for the increase of resources with respect to [LCVL06]. In particular, the difference between inter- polation degrees seriously impacts the resource usage as each degree implies one multiplier (4 DSP), half Block RAM and one adder. The use of samples with more bits and the complexity of floating-point arithmetic also explains the lose of speed and throughput (around a 20%). However, this is a very small performance penalty taking into account the accuracy improvement as the floating-point complexity is handled efficiently with a very deep pipelined architecture of 48 stages.
2.5.3. Stratified Sampling and Latin Hypercube
The devised hardware architecture for both Stratified Sampling and Latin Hypercube is composed of two main parts. The first part is the control unit, which controls the generation of the stratas in the form of integers which correspond to the strata, while the second part involves the arithmetic operations needed to transform uniform random variables into stratified ones. A simplified scheme of the architecture is depicted in Figure 2.9. Besides the control unit, two fixed-point operators, an adder and a divisor are needed to accomplished the stratification of the random variable: u i us = i + ; where i = 0, ..., n − 1 (2.40) i n n 48 Random Number Generation
Figure 2.10: Stratified Sampling Control Unit.
Additionally logic is needed for the case of a zero generation. To preserve the symmetry of some distributions (such as Gaussian) and also because a zero can led to −∞ when using the inverse generation method with not enclosed distributions (as Gaussian), the value zero has to be removed to obtain uniform random variables in (0,1). In this case, it is needed to generate an alternative uniform variable belonging to the same strata as the zero. One important feature for any architecture implementing a variance reduction technique is its configurability to allow simulations with different requirements in the number of stratas, n, or the number of dimensions, d. To achieve this feature, the whole architecture is parameterizable selecting a maximum value for both n and d.
2.5.3.1. Stratified Sampling. Control
The control unit deals with the generation of the starting points of the stratas. One key factor has to be considered for this generation, as usually each simulation does not only use one random variate, but rather a group of them. In this case, stratified sampling has to be applied to each of the random variates used, among a group of simulations. This fact introduces a new requirement, the generation s of ui will require of randomness in the selection of the strata starting points, to obtain non-ordered stratified samples. If ordered generation is used, all the random variates for the same simulation will belong to the same strata and the results will be distorted. A combination of a URNG with a shuffle technique over a memory table can effectively handle that requirement. While stratas starting points can be managed as a uniform distribution between 0 and n − 1 because they are scaled later dividing by n, the shuffle technique will introduce the randomness in the stratas starting points. The shuffle technique is applied to the integer values stored in the memory table by swapping pairs of values. The randomness is introduced in the selection of one of the memory positions to be swapped while the other one is determined by a counter that ensures that all memory positions are swapped at least once in the next algorithm:
for(i=n-1,i>=0,i–){ 2.5 Developed Gaussian Random Number Generator 49
Global Shuffle Shuffle Read Table Table Table
addr_random 0 0
counter addr_fixed counter active table dimensions Shuffle Read Table Table dim−1 dim−1