Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures Yunsong Wang

Total Page:16

File Type:pdf, Size:1020Kb

Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures Yunsong Wang Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures Yunsong Wang To cite this version: Yunsong Wang. Optimization of Monte Carlo Neutron Transport Simulations with Emerging Archi- tectures. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Paris Saclay (COmUE), 2017. English. NNT : 2017SACLX090. tel-01687913 HAL Id: tel-01687913 https://pastel.archives-ouvertes.fr/tel-01687913 Submitted on 18 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. NNT : 2017SACLX090 THESE` DE DOCTORAT DE L’UNIVERSITE´ PARIS-SACLAY PREPAR´ E´ A` L’ECOLE´ POLYTECHNIQUE ECOLE´ DOCTORALE No 573 INTERFACES : APPROCHES INTERDISCIPLINAIRES / FONDEMENTS, APPLICATIONS ET INNOVATION Sp´ecialit´ede doctorat: Informatique par M. Yunsong Wang Optimization of Monte Carlo Neutron Transport Simulations by Using Emerging Architectures Th`ese pr´esent´ee et soutenue `a Gif-sur-Yvette, le 14 d´ecembre 2017 : Composition de jury : M. Marc Verderi Directeur de Recherche, CNRS/IN2P3/LLR Pr´esident du jury M. Andrew Siegel Expert Senior, Argonne National Laboratory Rapporteur M. Raymond Namyst Professeur, Universit´ede Bordeaux/LABRI Rapporteur M. David Chamont Charg´ede Recherche, CNRS/IN2P3/LAL Examinateur M. David Riz Ing´enieur,CEA/DAM Examinateur M. Christophe Calvin Directeur de Recherche, CEA/DRF Directeur de th`ese M. Fausto Malvagi Expert Senior, CEA/DEN Encadrant M. Emeric Brun Ing´enieur,CEA/DEN Encadrant R´esum´e Les m´ethodes Monte Carlo (MC) sont largement utilis´eesdans le domaine du nucl´eaire pour r´esoudreles ´equationsdu transport neutronique. Ce type de m´ethode stochastique offre l’avantage de minimiser les approximations mais pr´esente un taux de convergence lent r´egitpar la loi des grands nombres, ce qui rend les simulations tr`escoˆuteuses en termes de temps de calcul. Auparavant, l’´evolution r´eguli`erede la puissance des processeurs permettait d’am´eliorerles performances du code sans avoir `areconcevoir les algorithmes. De nos jours, avec l’av`enement des architectures many-cœurs (Intel MIC) et les acc´el´erateurs mat´eriels,les travaux d’optimisation sont indispensables et sont une pr´eoccupation importante de la communaut´eMC. Dans ce type de simulation, les calculs les plus coˆuteuxconcernent le calcul des sec- tions efficaces, qui mod´elisent l’interaction probabiliste d’un neutron avec les nucl´eides qu’il rencontre sur sa trajectoire, au niveau microscopique. L’approche convention- nelle consiste `apr´e-calculer avant la simulation les sections efficaces de chaque type de nucl´eide,`achaque temp´eratureintervenant dans le syst`eme`apartir des informations des biblioth`equesde donn´eesnucl´eaires, et rang´edans des tables index´eespar l’´energie incidente. Ces donn´eessont ensuite charg´eesen m´emoire.Pendant le calcul, elles sont ensuite r´ecup´er´eesdans des tables et le caract`erestochastique du transport MC induit des acc`es`ala m´emoireal´eatoires pour un coˆutde calcul minime correspondant `aune interpolation lin´eaire. Dans ce cas, l’essentiel du temps de calcul se concentre sur des recherches dans ces tables de probabilit´es,qui sont de tailles tr`esvariables en fonction de chaque nucl´eide. Afin de minimiser les conflits d’acc`esm´emoire,nous avons ´etudi´eet optimis´eune vaste collection d’algorithmes de recherche afin d’acc´el´ererce processus de r´ecup´erationde donn´ees.Dans un premier temps, nous avons ´etudi´eet propos´eplusieurs alternatives `a la recherche binaire conventionnelle, telles que la recherche N-aire (variante vectoris´ee de la recherche binaire) et la recherche lin´eairevectoris´ee. Les ´evaluations montrent qu’une acc´el´erationsignificative peut ˆetreobtenue par rapport `ala recherche binaire conventionnelle `ala fois sur CPU classique et architecture many-cœurs. Cependant, la recherche N-aire n’est pas plus performante que la recherche binaire, ce qui indique que les efforts de vectorisation sont p´enalis´espar des acc`esdispers´eset une mauvaise utilisation du cache. La vectorisation de certains des algorithmes s’est montr´eeefficace sur l’architecture MIC grˆace`ases unit´esvectorielles de 512 bits. Cette am´elioration est moins nette sur CPU, o`ules registres vectoriels ´etant deux fois plus petits. Dans la seconde partie, nous avons ´etudi´edes multiples structures de donn´eespossibles per- mettant d’optimiser l’acc`esaux tableaux des sections efficaces. Ainsi, la grille d’´energie unifi´eeest la plus performante dans tous les cas. Ce type de m´ethode synth`etisetous les i ii tableaux en un seul, donc la recherche dans la table n’a besoin d’ˆetreeffectu´eequ’une seule fois pour chaque calcul de section efficace macroscopique. Cascade fractionnaire, un compromis similaire utilisant moins d’espace m´emoire,est ´egalement impl´ement´ee et ´evalu´ee. Les m´ethodes d’acc`espar tables de hachage, avec des strat´egiesde hachage lin´eaireset logarithmiques, pr´esente le meilleur compromis entre performance et em- preinte m´emoire.A partir de ces algorithmes existent, nous avons propos´ele N-ary map en utilisant des tampons suppl´ementaires pour assurer des acc`esm´emoireefficaces lors du chargement des bordures dans les registres vectoriels. Toutes ces variantes sont ´evalu´ees sur les syst`emesde multi-cœurs et de many-cœurs. Cependant, ils sont redoutablement inefficaces du fait de la saturation de la m´emoireet du manque de vectorisation de ces algorithmes. En d’autres termes, la puissance de calcul des architectures modernes est en grande partie gaspill´ee.Une optimisation suppl´ementaire comme la r´eductionde la m´emoires’av`eretr`esimportante car elle am´elioreen grande partie les performances in- formatiques. Un autre probl`ememajeur de ces m´ethodes est l’empreinte m´emoiretr`es importante qu’elles induisent quand un grand nombre de temp´eraturessont `aconsid´erer comme par exemple dans le cadre d’un couplage `ala thermo-hydraulique. En effet, les sections efficaces `aune temp´eraturepour les quatre cent noyaux impliqu´eesdans une simulation repr´esentent pr`esd’un gigaoctet d’espace m´emoire,ce qui rend impossible la simulation `aplusieurs centaines de temp´eratures. Afin de r´esoudrece probl`eme,nous avons ´etudi´eune approche radicalement oppos´ee : la reconstruction au vol des sections efficaces. L’id´eeest de r´ealiserle calcul des sec- tions efficaces `apartir des donn´ees ´el´ementaires se trouvant dans les biblioth`eques de donn´eesnucl´eaires (description des r´esonances) ainsi que le traitement en temp´erature (´elargissement Doppler) pendant la simulation, `achaque fois qu’une section efficace est n´ecessaire.Cet algorithme, tr`escalculatoire, repose sur une formulation similaire `acelle utilis´eedans le calcul des biblioth`equesstandards de neutronique. Cette approche per- met de passer d’un probl`emede type “memory-bound” `aun probl`emede type “compute- bound” : seules quelques variables pour chaque r´esonancesont n´ecessairesau lieu de la table conventionnelle. L’espace m´emoire est ainsi largement r´eduit,ainsi que les con- flits d’acc`es. Cette m´ethode est cependant extrˆemement coˆuteuseen temps de calcul. Apr`esune s´eried’optimisations, les r´esultatsmontrent que le noyau de reconstruction b´en´eficiede la vectorisation et peut atteindre 1,806 GFLOPS (en simple pr´ecision)sur un Knights Landing 7250, ce qui repr´esente 67% de la puissance crˆete. Cela permet d’envisager des simulations `aplusieurs centaines de temp´eratures,ce qui est impossible avec une approche classique. Mˆemesi ces efforts d’optimisation am´eliorent significativement la performance, ce calcul iii `ala vol´eereste encore beaucoup plus lent que les m´ethodes de recherche convention- nelles. Les travaux futurs devront donc se concentrer sur les optimisations algorith- miques comme des m´ethodes de fenˆetragede multipole. Si ce travail s’est concentr´esur les architectures many-cœurs, les impl´ementations sur acc´el´erateurgraphique (GPGPU) m´eritent d’ˆetre´etudi´eespour esp´ereratteindre des performances plus ´elev´ees. D’autre part, la prise en compte de l’efficacit´e´energ´etiquedevient cruciale pour ´evaluer les algo- rithmes sur les architectures modernes. Une solution de compromis entre la performance maximale et la consommation d’´energieest d´esormaisn´ecessaires.Ainsi les algorithmes propos´esdevront ˆetrer´eexamin´essous l’angle de la consommation ´energ´etiqueet pas uniquement sur leur performance brute ou leur occupation m´emoire. Acknowledgements I would firstly like to thank my thesis director, Christophe Calvin, for the opportunity he has given me as a Ph.D. student, and for his lightning-fast supports every time I was in trouble. I am grateful for his wise guidance and inclusiveness throughout my three year’s study. I acknowledge my advisor, Fausto Malvagi, for his patience with me, and his willingness to share his invaluable insights to the field. I really miss the discussions we had every Friday after lunch in his office. I would also like to thank my advisor, Emeric Brun, for the aids he offered to deal with technical issues. I am grateful to him for remembering my girlfriend’s birthday during countless ssh connections. I thank Fran¸cois-Xavier Hugot for introducing OCaml as well as his skillful codes. I want to express my gratitude to Philippe Thierry of Intel, for always providing us with the latest hardware. I would like to thank David Chamont for his role as my academical advisor. Specifically, I thank Pierre Gu´erinof EDF R&D for guiding me into the field of HPC and Monte Carlo transport. I thank the director of Maison de la Simulation, Edouard Audit, for accepting me to study at this laboratory.
Recommended publications
  • China's Progress in Semiconductor Manufacturing Equipment
    MARCH 2021 China’s Progress in Semiconductor Manufacturing Equipment Accelerants and Policy Implications CSET Policy Brief AUTHORS Will Hunt Saif M. Khan Dahlia Peterson Executive Summary China has a chip problem. It depends entirely on the United States and U.S. allies for access to advanced commercial semiconductors, which underpin all modern technologies, from smartphones to fighter jets to artificial intelligence. China’s current chip dependence allows the United States and its allies to control the export of advanced chips to Chinese state and private actors whose activities threaten human rights and international security. Chip dependence is also expensive: China currently depends on imports for most of the chips it consumes. China has therefore prioritized indigenizing advanced semiconductor manufacturing equipment (SME), which chip factories require to make leading-edge chips. But indigenizing advanced SME will be hard since Chinese firms have serious weaknesses in almost all SME sub-sectors, especially photolithography, metrology, and inspection. Meanwhile, the top global SME firms—based in the United States, Japan, and the Netherlands—enjoy wide moats of intellectual property and world- class teams of engineers, making it exceptionally difficult for newcomers to the SME industry to catch up to the leading edge. But for a country with China’s resources and political will, catching up in SME is not impossible. Whether China manages to close this gap will depend on its access to five technological accelerants: 1. Equipment components. Building advanced SME often requires access to a range of complex components, which SME firms often buy from third party suppliers and then assemble into finished SME.
    [Show full text]
  • Technology Roadmap for 22Nm CMOS and Beyond
    Technology Roadmap for 22nm CMOS and beyond June 1, 2009 IEDST 2009@IIT-Bombay Hiroshi Iwai Tokyo Institute of Technology 1 Outline 1. Scaling 2. ITRS Roadmap 3. Voltage Scaling/ Low Power and Leakage 4. SRAM Cell Scaling 5.Roadmap for further future as a personal view 2 1. Scaling 3 Scaling Method: by R. Dennard in 1974 1 Wdep: Space Charge Region (or Depletion Region) Width 1 1 SDWdep has to be suppressed 1 Otherwise, large leakage Wdep between S and D I Leakage current Potential in space charge region is high, and thus, electrons in source are 0 attracted to the space charge region. 0 V 1 K=0.7 X , Y, Z :K, V :K, Na : 1/K for By the scaling, Wdep is suppressed in proportion, example and thus, leakage can be suppressed. K Good scaled I-V characteristics K K Wdep V/Na K Wdep I I : K : K 0 0K V 4 Downscaling merit: Beautiful! Geometry & L , W g g K Scaling K : K=0.7 for example Supply voltage Tox, Vdd Id = vsatWgCo (Vg‐Vth) Co: gate C per unit area Drive current I d K –1 ‐1 ‐1 in saturation Wg (tox )(Vg‐Vth)= Wgtox (Vg‐Vth)= KK K=K Id per unit Wg Id/µm 1 Id per unit Wg = Id / Wg= 1 Gate capacitance Cg K Cg = εoεoxLgWg/tox KK/K = K Switching speed τ K τ= CgVdd/Id KK/K= K Clock frequency f 1/K f = 1/τ = 1/K Chip area Achip α α: Scaling factor In the past, α>1 for most cases Integration (# of Tr) N α/K2 N α/K2 = 1/K2 , when α=1 Power per chip P α fNCV2/2 K‐1(αK‐2)K (K1 )2= α = 1, when α=1 5 k= 0.7 and α =1 k= 0.72 =0.5 and α =1 Single MOFET Vdd 0.7 Vdd 0.5 Lg 0.7 Lg 0.5 Id 0.7 Id 0.5 Cg 0.7 Cg 0.5 P (Power)/Clock P (Power)/Clock 0.73 = 0.34 0.53 = 0.125 τ (Switching time) 0.7 τ (Switching time) 0.5 Chip N (# of Tr) 1/0.72 = 2 N (# of Tr) 1/0.52 = 4 f (Clock) 1/0.7 = 1.4 f (Clock) 1/0.5 = 2 P (Power) 1 P (Power) 1 6 - The concerns for limits of down-scaling have been announced for every generation.
    [Show full text]
  • Moore's Law: the Future of Si Microelectronics
    Moore’s law: the future of Si microelectronics Soon after Bardeen, Brattain, and Shockley invented a solid-state device in 19471 to replace electron vacuum tubes, the microelectronics industry and a revolution started. Since its birth, the industry has experienced four decades of unprecedented explosive growth driven by two factors: Noyce and Kilby inventing the planar integrated circuit2,3 and the advantageous characteristics that result from scaling (shrinking) solid-state devices. Scott E. Thompson and Srivatsan Parthasarathy SWAMP Center, Department of Electrical and Computer Engineering, University of Florida, Gainsville, FL 32611-6130 USA E-mail:[email protected], [email protected] Scaling solid-state devices has the peculiar property of improving approaches under investigation are: (1) nonclassical CMOS, which cost, performance, and power, which has historically given any consists of new channel materials and/or multigate fully depleted company with the latest technology a large competitive device structures; and (2) alternatives to CMOS, such as spintronics, advantage in the market. As a result, the microelectronics single electron devices, and molecular computing8,9. While some of industry has driven transistor feature size scaling from 10 µm to these non-Si research areas are important and will be successful in ~30 nm4-6 during the past 40 years. During most of this time, new applications and markets10, it seems unlikely any of the non-Si scaling simply consisted of reducing the feature size. However, options can replace the Si transistor for the $300 billion during certain periods, there were major changes as with the microelectronics industry in the foreseeable future (perhaps as long industry move from Si bipolar to p-channel metal-oxide- as 30 years).
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Design and Analysis of a New Carbon Nanotube Full Adder Cell
    Hindawi Publishing Corporation Journal of Nanomaterials Volume 2011, Article ID 906237, 6 pages doi:10.1155/2011/906237 Research Article Design and Analysis of a New Carbon Nanotube Full Adder Cell M. H. Ghadiry,1 Asrulnizam Abd Manaf,1 M. T. Ahmadi,2 Hatef Sadeghi,2 and M. Nadi Senejani3 1 School of Electrical and Electronic Engineering, Universiti Sains Malaysia, Engineering Campus, 11800 Penang, Malaysia 2 Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Malaysia 3 Department of Computer Engineering, Islamic Azad University, Ashtian Branch, 39618-13347 Ashtian, Iran Correspondence should be addressed to M. H. Ghadiry, [email protected] Received 10 January 2011; Accepted 27 February 2011 Academic Editor: Theodorian Borca-Tasciuc Copyright © 2011 M. H. Ghadiry et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A novel full adder circuit is presented. The main aim is to reduce power delay product (PDP) in the presented full adder cell. A new method is used in order to design a full-swing full adder cell with low number of transistors. The proposed full adder is implemented in MOSFET-like carbon nanotube technology and the layout is provided based on standard 32 nm technology from MOSIS. The simulation results using HSPICE show that there are substantial improvements in both power and performance of the proposed circuit compared to the latest designs. In addition, the proposed circuit has been implemented in conventional 32 nm process to compare the benefits of using MOSFET-like carbon nanotubes in arithmetic circuits over conventional CMOS technology.
    [Show full text]
  • Challenges and Innovations in Nano‐CMOS Transistor Scaling
    Challenges and Innovations in Nano‐CMOS Transistor Scaling Tahir Ghani Intel Fellow Logic Technology Development October, 2009 Nikkei Presentation 1 Outline • Traditional‐Scaling ‐ Traditional Scaling Limiters, Device Implications ‐ Intel’s Response • Post “Traditional‐Scaling” Innovations ‐ Mobility Booster: Uniaxial Strain - Poly Depletion Elimination: Metal Gate - Gate Leakage Reduction: HiK • Future Challenges and Options - Power Limitation - Potential New Transistor Structures and Materials 40+ Years of Moore’s Law at INTEL: From Few to Billions of Transistors 2X transistors every 2 years Transistor Count has Doubled Every Two Years 40+ Years of Moore’s Law at INTEL: From Few to Billions of Transistors 2X transistors every 2 years Traditional Scaling Era END OF TRADITIONAL SCALING ERA ~ 2003 Lasted ~40 YEARS Top “Traditional-Scaling” Enablers R. Dennard et.al. IEEE JSSC, 1974 • Gate Oxide Thickness Scaling - Key enabler for Lgate scaling • Junction Scaling - Another enabler for Lgate scaling - Improved abruptness (REXT reduction) • Vcc Scaling - Reduce XDEP (improve SCE) - However, did not follow const E field 1990’s: Golden Era of Scaling Vcc, Tox & Lg scaling & increasing Idsat Year 2000: INTEL 90nm CMOS Pathfinding End of “Traditional-Scaling” Era Gate oxide running Mobility degrades out of atoms with scaling 1.E+04 Jox limit VLSI Symp. 2000 300 Universal 1.E+03 NA= Mobility 3x1017 ] 1.E+02 SiO 250 2 2 /(V.s) [Lo et. al, EDL97] 2 1.E+01 18 200 1.3x10 [A/cm 130nm 1.E+00 1.8x1018 OX J 1.E-01 18 150 2.5x10 18 1.E-02 Nitrided SiO2 180nm 3.3x10 1.E-03 Mobility (cm 100 0.51.01.52.02.5 0 0.5 1 1.5 TOX Physical [nm] E EFF [MV/cm] • Gate Oxide Leakage • Universal Mobility Model direct tunneling limited • Ionized impurity scattering T.
    [Show full text]
  • Prospects for the Application of Nanotechnologies to the Computer System Architecture
    JOURNAL OFNANO- AND ELECTRONICPHYSICS ЖУРНАЛ НАНО- ТА ЕЛЕКТРОННОЇ ФІЗИКИ Vol. 4 No 1, 01003(6pp) (2012) Том 4 № 1, 01003(6cc) (2012) Prospects for the Application of Nanotechnologies to the Computer System Architecture J. Partyka1, M. Mazur2,* 1 Faculty of Electrical Engineering and Information Science, Lublin University of Technology 38а, ul. Nadbystrzycka 2 Wyższa Szkoła Ekonomii i Innowacji w Lublinie ul. Mełgiewska 7-9 (Received 26 September 2011; published online 14 March 2012) Computer system architecture essentially influences the comfort of our everyday living. Developmental transition from electromechanical relays to vacuum tubes, from transistors to integrated circuits has sig- nificantly changed technological standards for the architecture of computer systems. Contemporary infor- mation technologies offer huge potential concerning miniaturization of electronic circuits. Presently, a modern integrated circuit includes over a billion of transistors, each of them smaller than 100 nm . Step- ping beyond the symbolic 100 nm limit means that with the onset of the 21 century we have entered a new scientific area that is an era of nanotechnologies. Along with the reduction of transistor dimensions their operation speed and efficiency grow. However, the hitherto observed developmental path of classical elec- tronics with its focus on the miniaturization of transistors and memory cells seems arriving at the limits of technological possibilities because of technical problems as well as physical limitations related to the ap- pearance of
    [Show full text]
  • Si MOSFET Roadmap for 22Nm and Beyond
    Si MOSFET Roadmap for 22nm and beyond December 16, 2009 Jadavpur University @ Kolkata, India Hiroshi Iwai Tokyo Institute of Technology 1 Outline 1. Scaling 2. ITRS Roadmap 3. Voltage Scaling/ Low Power and Leakage 4. SRAM Cell Scaling 5.Roadmap for further future 2 1. Scaling 3 Scaling Method: by R. Dennard in 1974 1 Wdep: Space Charge Region (or Depletion Region) Width 1 1 SDWdep has to be suppressed 1 Otherwise, large leakage Wdep between S and D I Leakage current Potential in space charge region is high, and thus, electrons in source are 0 attracted to the space charge region. 0 V 1 K=0.7 X , Y, Z :K, V :K, Na : 1/K for By the scaling, Wdep is suppressed in proportion, example and thus, leakage can be suppressed. K Good scaled I-V characteristics K K Wdep V/Na K Wdep I I : K : K 0 0K V 4 Downscaling merit: Beautiful! Geometry & L , W g g K Scaling K : K=0.7 for example Supply voltage Tox, Vdd Id = vsatWgCo (Vg‐Vth) Co: gate C per unit area Drive current I d K –1 ‐1 ‐1 in saturation Wg (tox )(Vg‐Vth)= Wgtox (Vg‐Vth)= KK K=K Id per unit Wg Id/µm 1 Id per unit Wg = Id / Wg= 1 Gate capacitance Cg K Cg = εoεoxLgWg/tox KK/K = K Switching speed τ K τ= CgVdd/Id KK/K= K Clock frequency f 1/K f = 1/τ = 1/K Chip area Achip α α: Scaling factor In the past, α>1 for most cases Integration (# of Tr) N α/K2 N α/K2 = 1/K2 , when α=1 Power per chip P α fNCV2/2 K‐1(αK‐2)K (K1 )2= α = 1, when α=1 5 2 Generations k= 0.72 =0.5 and α =1 Single MOFET Vdd 0.5 Lg 0.5 Id 0.5 Cg 0.5 P (Power)/Clock 0.53 = 0.125 τ (Switching time) 0.5 Chip N (# of Tr) 1/0.52 = 4 f (Clock) 1/0.5 = 2 P (Power) 1 6 - The concerns for limits of down-scaling have been announced for every generation.
    [Show full text]
  • Parallel Computing: Multithreading and Multicore We Demand Increasing Performance for Desktop Applications
    The Problem Parallel Computing: MultiThreading and MultiCore We demand increasing performance for desktop applications. How can we get that? There are four approaches that we’re going to discuss here: Mike Bailey 1. We can increase the clock speed (the “La-Z-Boy approach”). [email protected] 2. We can combine several separate computer systems, all working together Oregon State University (multiprocessing). 3. We can develop a single chip which contains multiple CPUs on it (multicore). 4. We can look at where the CPU is spending time waiting, and give it something else to do while it’s waiting (multithreading). "If you were plowing a field, which would you rather use – two strong oxen or 1024 chickens?" -- Seymore Cray Oregon State University Oregon State University Computer Graphics Computer Graphics mjb – November 13, 2012 mjb – November 13, 2012 1. Increasing Clock Speed -- Moore’s Law Moore’s Law “Transistor density doubles every 1.5 years.” From 1986 to 2002, processor performance increased an average of 52%/year Fabrication process sizes (“gate pitch”) have fallen from 65 nm to 45 nm to 32 nm to 22 nm Next will be 16 nm ! Note that oftentimes people ( incorrectly ) equivalence this to: “Clock speed doubles every 1.5 years.” Oregon State University Oregon State University Computer Graphics Source: http://www.intel.com/technology/mooreslaw/index.htm Computer Graphics mjb – November 13, 2012 mjb – November 13, 2012 Moore’s Law Clock Speed and Power Consumption From 1986 to 2002, processor performance increased an average of 52%/year which 1981 IBM PC 5 MHz means that it didn’t quite double every 1.5 years, but it did go up by 1.87, which is close.
    [Show full text]
  • Roadmap for 22Nm Logic CMOS and Beyond
    Roadmap for 22nm Logic CMOS and Beyond March 9, 2009 @Heritage Institute of Technology Hiroshi Iwai, Tokyo Institute of Technology 1 • There were many inventions in the 20th century: Airplane, Nuclear Power generation, Computer, Space aircraft, etc • However, everything has to be controlled by electronics • Electronics Most important invention in the 20th century • What is Electronics: To use electrons, Electronic Circuits Electronic Circuits started by the invention of vacuum tube (Triode) in 1906 Thermal electrons from cathode controlled by grid bias Lee De Forest Cathode Anode (heated) Grid (Positive bias) Same mechanism as that of transistor 4 wives of Lee De Forest 1906 Lucille Sheardown 1907 Nora Blatch 1912 Mary Mayo, singer 1930 Marie Mosquini, silent film actress Mary Marie 4 First Computer Eniac: made of huge number of vacuum tubes 1946 Big size, huge power, short life time filament Æ dreamed of replacing vacuum tube with solid‐state device Today's pocket PC made of semiconductor has much higher performance with extremely low power consumption 5 History of Semiconductor devices 1947, 1st Point Contact Bipolar Transistor: Ge Semiconductor, Bardeen, Brattin Æ Nobel Prize 1948, 1st Junction Bipolar Transistor, Ge Semiconductor, Schokley Æ Nobel Prize 1958, 1st Integrated Circuits, Ge Semiconductor, J.Kilby Æ Nobel Prize 1959, 1st Planar Integrated Circuits, R.Noice 1960, 1st MOS Transistor, Kahng, Si Semiconductor 1963, 1st CMOS Circuits, C.T. Sah and F. Wanlass 6 J. E. LILIENFELD DEVICES FOR CONTROLLED ELECTRIC CURRENT Filed March 28, 1928 J.E.LILIENFELD 7 Capacitor structure with notch Negative bias Gate Electrode Gate Insulator Semiconductor Electron No current Positive bias Electric field Current flows 8 G Surface Gate electrode Gate Oxd Channel Drain Source SD Electron flow 0 bias for gate Positive bias for gate Surface Potential (Negative direction) Negative 0V 0V N+-Si P-Si N+-Si P-Si 1V 1V N-Si N-Si Source Channel Drain Source Channel Drain 9 However, no one could realize MOSFET operation for more than 30 years.
    [Show full text]
  • Power and Performance Optimization for Network-On-Chip Based Many-Core Processors
    Power and Performance Optimization for Network-on-Chip based Many-Core Processors YUAN YAO Doctoral Thesis in Information and Communication Technology (INFKOMTE) School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 2019 KTH School of Electrical Engineering and Computer Science TRITA-EECS-AVL-2019:44 Electrum 229, SE-164 40 Stockholm ISBN 978-91-7873-182-4 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Informations- och Kommunikationsteknik fredag den 23 Augusti 2019 klockan 9.00 i Sal B, Elect- rum, Kungl Tekniska högskolan, Kistagången 16, Kista. © Yuan Yao, May 27, 2019 Tryck: Universitetsservice US AB iii Abstract Network-on-Chip (NoC) is emerging as a critical shared architecture for CMPs (Chip Multi-/Many-Core Processors) running parallel and concurrent applications. As the core count scales up and the transistor size shrinks, how to optimize power and performance for NoC open new research challenges. As it can potentially consume 20–40% of the entire chip power [20, 40, 81], NoC power efficiency has emerged as one of the main design constraints in today’s and future high performance CMPs. For NoC power management, we propose a novel on-chip DVFS technique that is able to adjust per-region NoC V/F according to voted V/F levels from communicating threads. A thread periodically votes for a preferred NoC V/F level that best suits its individual performance interests. The final DVFS decision of each region is adjusted by a region DVFS controller democratically based on the majority of votes it receives.
    [Show full text]
  • VLSI Trends a Brief History
    VLSI Trends A Brief History 1958: First integrated circuit Flip-flop using two transistors From Texas Instruments 2011 Courtesy Texas Instruments Intel 10 Core Xeon Westmere-EX 2.6 billion transistors 32 nm process Courtesy Intel Moore’s Law Growth rate 2x transistors & clock speeds every 2 years over 50 years 10x every 6-7 years Dramatically more complex algorithms previously not feasible Dramatically more realistic video games and graphics animation (e.g. Playstation 4, Xbox 360 Kinect, Nintendo Wii) 1 Mb/s DSL to 10 Mb/s Cable to 2.4 Gb/s Fiber to Homes 2G to 3G to 4G wireless communications MPEG-1 to MPEG-2 to MPEG-4 to H.264 video compression 480 x 270 (0.13 million pixels) NTSC to 1920x1080 (2 million pixels) HDTV resolution Moore’s Law Moore’s Law Many other factors grow exponentially Ex: clock frequency, processor performance Standard Cells NOR-3 XOR-2 Standard Cell Layout GeForce 8800 (600+ million transistors, about 60+ million gates) Subwavelength Lithography Challenges Source: Raul Camposano, 2003 NRE Mask Costs Source: MIT Lincoln Labs, M. Fritze, October 2002 ASIC NRE Costs Not Justified for Many Applications Forecast: By 2010, a complex ASIC will have an NRE Cost of over $40M = $28M (NRE Design Cost) + $12M (NRE Mask Cost) Many “ASIC” applications will not have the volume to justify a $40M NRE cost e.g. a $30 IC with a 33% margin would require sales of 4M units (x $10 profit/IC) just to recoup $40M NRE Cost Power Density a Key Issue Motivated mainly by power limits Ptotal = Pdynamic + Pleakage 2 Pdynamic = ½ a C VDD f Problem: power (heat dissipation) density has been growing exponentially because clock frequency (f) and transistor count have been doubling every 2 years Power Density a Key Issue Intel VP Patrick Gelsinger (ISSCC 2001) “If scaling continues at present pace, by 2005, high speed processors would have power density of nuclear reactor, by 2010, a rocket nozzle, and by 2015, surface of sun.” Courtesy Intel Before Multicore Processors e.g.
    [Show full text]