Optimizing the TLB Shootdown Algorithm with Page Access Tracking

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing the TLB Shootdown Algorithm with Page Access Tracking Optimizing the TLB Shootdown Algorithm with Page Access Tracking Nadav Amit, VMware Research https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6 Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX. Optimizing the TLB Shootdown Algorithm with Page Access Tracking Nadav Amit VMware Research Abstract their TLBs according to the information supplied by The operating system is tasked with maintaining the the initiator core, and they report back when they coherency of per-core TLBs, necessitating costly syn- are done. TLB shootdown can take microseconds, chronization operations, notably to invalidate stale causing a notable slowdown [48]. Performing TLB mappings. As core-counts increase, the overhead of shootdown in hardware, as certain CPUs do, is faster TLB synchronization likewise increases and hinders but still incurs considerable overheads [22]. scalability, whereas existing software optimizations In addition to reducing performance, shootdown that attempt to alleviate the problem (like batching) overheads can negatively affect the way applications are lacking. are constructed. Notably, to avoid shootdown la- We address this problem by revising the TLB tency, programmers are advised against using mem- synchronization subsystem. We introduce several ory mappings, against unmapping them, and even techniques that detect cases whereby soon-to-be against building multithreaded applications [28, 42]. invalidated mappings are cached by only one TLB But memory mappings are the efficient way to use or not cached at all, allowing us to entirely avoid persistent memory [18, 47], and avoiding unmap- the cost of synchronization. In contrast to existing pings might cause corruption of persistent data [12]. optimizations, our approach leverages hardware OSes try to cope with shootdown overheads by page access tracking. We implement our techniques batching them [21, 43], avoiding them on idle cores, in Linux and find that they reduce the number of or, when possible, performing them faster [5]. But TLB invalidations by up to 98% on average and thus the potential of these existing solutions is inherently improve performance by up to 78%. Evaluations limited to certain specific scenarios. To have a gener- show that while our techniques may introduce ally applicable, efficient solution, OSes need do know overheads of up to 9% when memory mappings which mappings are cached by which cores. Such in- are never removed, these overheads can be avoided formation can in principle be obtained by replicating by simple hardware enhancements. the translation data structures for each core [11], but this approach might result in significantly degraded 1. Introduction performance and wasted memory. We propose to avoid unwarranted TLB shoot- Translation lookaside buffers (TLBs) are perhaps the downs in a different manner: by monitoring access most frequently accessed caches whose coherency bits. While TLB coherency is not maintained by the is not maintained by modern CPUs. The TLB is CPU, CPU architectures can maintain the consis- tasked with caching virtual-to-physical translations tency of access bits, which are set when a mapping (“mappings”) of memory addresses, and so it is is cached. We contend that these bits can therefore accessed upon every memory read or write operation. be used to reveal which mappings are cached by Maintaining TLB coherency in hardware hampers which cores. To our knowledge, we are the first to performance [33], so CPU vendors require OSes to use access bits in this way. maintain coherency in software. But it is difficult for In the x86 architecture, which we study in this OSes to efficiently achieve this goal [27, 38, 39, 41, 48]. paper, access bit consistency is maintained by the To maintain TLB coherency, OSes employ the memory subsystem. Exploiting it, we propose tech- TLB shootdown protocol [8]. If a mapping m that niques to identify two types of common mappings possibly resides in the TLB becomes stale (due to whose shootdown can be avoided: (1) short-lived memory mapping changes) the OS flushes m from private mappings, which are only cached by a single the local TLB to restore coherency. Concurrently, core; and (2) long-lived idle mappings, which are the OS directs remote cores that might house m in reclaimed after the corresponding pages have not their TLB to do the same, by sending them an inter- been used for a while and are not cached at all. Using processor interrupt (IPI). The remote cores flush USENIX Association 2017 USENIX Annual Technical Conference 27 these techniques, we implement a fully functional x86 CPUs do not maintain coherence between the prototype in Linux 4.5. Our evaluation shows that TLB and the page-tables, nor among the TLBs of our proposal can eliminate more than 90% of TLB different cores. As a result, page-table changes may shootdowns and improve the performance of mem- leave stale entries in the TLBs until coherence is ory migration by 78%, of copy-on-write events by restored by the OS. The instruction set enables the 18–25%, and of multithreaded applications (Apache OS to do so by flushing (“invalidating”) individual and parallel bzip2) by up to 12%. PTEs or the entire TLB. Global and individual TLB Our system introduces a worst case slowdown flushes can only be performed locally, on the TLB of of up to 9% when mappings are only set and never the core that executes the flush instruction. removed or changed, which means no shootdown Although the TLB is essential to attain reasonable activity is conducted. This slowdown is caused, translation latency, some workloads experience fre- according to our measurements, due to the overhead quent TLB cache-misses [4]. Recently, new features of our TLB manipulation software techniques. To were introduced into the x86 architecture to reduce eliminate it, we propose a CPU extension that would the number and latency of TLB misses. A new instruc- allow OSes to write entries directly into the TLB, and tion set extension allows each page-table hierarchy to resembles the functionality provided by CPUs that be associated with an address-space ID (ASID) and employ software-TLB. avoid TLB flushes during address-space switching, thus reducing the number of TLB misses. Micro- 2. Background and Motivation architectural enhancements introduced page-walk caches that enable the hardware to cache internal 2.1 Memory Management Hardware nodes in the page-table hierarchy, thereby reducing Virtual memory is supported by most modern CPUs TLB-miss latencies [3]. and used by all the major OSes [9, 32]. Using vir- tual memory allows the OS to utilize the physical 2.2 TLB Software Challenges memory more efficiently and to isolate the address The x86 architecture leaves maintaining TLB co- space of each process. The CPU translates the virtual herency to the OSes, which often requires frequent addresses to physical addresses before memory ac- TLB invalidations after PTE changes. OS kernels cesses are performed. The OS sets the virtual address can make such PTE changes independently of the translations (also called “mappings”) according to running processes, upon memory migration across its policies and considerations. NUMA nodes [2], memory deduplications [49], mem- The memory mappings of each address space are ory reclamation, and memory compaction for accom- kept in a memory-resident data structure, which is modating huge pages [14]. Processes can also trigger defined by the CPU architecture. The most common PTE changes by using system calls, for example data structure, used by the x86 architecture, is a radix- mprotect, which changes protection on a memory tree, which is also known as a page-table hierarchy. range, or by writing to copy-on-write pages (COW). The leaves of the tree, called the page-table entries These PTE changes can require a TLB flush to (PTEs), hold the translations of fixed-sized virtual avoid caching of stale PTEs in the TLB. We distin- memory pages to physical frames. To translate a guish between two types of flushes: local and remote, virtual address into a physical address, the CPU in accordance with the core that initiated the PTE incorporates a memory management unit (MMU), change. Remote TLB flushes are significantly more which performs a “page table walk” on the page expensive, since most CPUs cannot flush remote table hierarchy, checking access permissions at every TLBs directly. OSes therefore perform a TLB shoot- level. During a page-walk, the MMU updates the down: The initiating core sends an inter-processor status bits in each PTE, indicating whether the page interrupt (IPI) to the remote cores and waits for was read from and/or written to (dirtied). their interrupt handlers to invalidate their TLBs and To avoid frequent page-table walks and their as- acknowledge that they are done. sociated latency, the MMU caches translations of TLB shootdowns introduce a variety of overheads. recently used pages in a translation lookaside buffer IPI delivery can take several hundreds of cycles [5]. (TLB). In the x86 architecture, these caches are main- Then, the IPI may be kept pending if the remote core tained by the hardware, bringing translations into has interrupts disabled, for instance while running the cache after page walks and evicting them accord- a device driver [13]. The x86 architecture does ing to an implementation-specific cache replacement not allow OSes to flush multiple PTEs efficiently, policy. Each x86 core holds a logically private TLB. requiring the OS to either incur the overhead of Unlike memory caches, TLBs of different CPUs are multiple flushes or flush the entire TLB and increase not maintained coherent by hardware. Specifically, the TLB miss rate. In addition, TLB flushes may 28 2017 USENIX Annual Technical Conference USENIX Association indirectly cause lock contention since they are often 2.4 Per-Core Page Tables performed while the OS holds a lock [11, 15].
Recommended publications
  • The Case for Compressed Caching in Virtual Memory Systems
    THE ADVANCED COMPUTING SYSTEMS ASSOCIATION The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference Monterey, California, USA, June 6-11, 1999 The Case for Compressed Caching in Virtual Memory Systems _ Paul R. Wilson, Scott F. Kaplan, and Yannis Smaragdakis aUniversity of Texas at Austin © 1999 by The USENIX Association All Rights Reserved Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org The Case for Compressed Caching in Virtual Memory Systems Paul R. Wilson, Scott F. Kaplan, and Yannis Smaragdakis Dept. of Computer Sciences University of Texas at Austin Austin, Texas 78751-1182 g fwilson|sfkaplan|smaragd @cs.utexas.edu http://www.cs.utexas.edu/users/oops/ Abstract In [Wil90, Wil91b] we proposed compressed caching for virtual memory—storing pages in compressed form Compressed caching uses part of the available RAM to in a main memory compression cache to reduce disk pag- hold pages in compressed form, effectively adding a new ing. Appel also promoted this idea [AL91], and it was level to the virtual memory hierarchy. This level attempts evaluated empirically by Douglis [Dou93] and by Russi- to bridge the huge performance gap between normal (un- novich and Cogswell [RC96]. Unfortunately Douglis’s compressed) RAM and disk.
    [Show full text]
  • Cygwin User's Guide
    Cygwin User’s Guide Cygwin User’s Guide ii Copyright © Cygwin authors Permission is granted to make and distribute verbatim copies of this documentation provided the copyright notice and this per- mission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this documentation under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this documentation into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation. Cygwin User’s Guide iii Contents 1 Cygwin Overview 1 1.1 What is it? . .1 1.2 Quick Start Guide for those more experienced with Windows . .1 1.3 Quick Start Guide for those more experienced with UNIX . .1 1.4 Are the Cygwin tools free software? . .2 1.5 A brief history of the Cygwin project . .2 1.6 Highlights of Cygwin Functionality . .3 1.6.1 Introduction . .3 1.6.2 Permissions and Security . .3 1.6.3 File Access . .3 1.6.4 Text Mode vs. Binary Mode . .4 1.6.5 ANSI C Library . .4 1.6.6 Process Creation . .5 1.6.6.1 Problems with process creation . .5 1.6.7 Signals . .6 1.6.8 Sockets . .6 1.6.9 Select . .7 1.7 What’s new and what changed in Cygwin . .7 1.7.1 What’s new and what changed in 3.2 .
    [Show full text]
  • Extracting Compressed Pages from the Windows 10 Virtual Store WHITE PAPER | EXTRACTING COMPRESSED PAGES from the WINDOWS 10 VIRTUAL STORE 2
    white paper Extracting Compressed Pages from the Windows 10 Virtual Store WHITE PAPER | EXTRACTING COMPRESSED PAGES FROM THE WINDOWS 10 VIRTUAL STORE 2 Abstract Windows 8.1 introduced memory compression in August 2013. By the end of 2013 Linux 3.11 and OS X Mavericks leveraged compressed memory as well. Disk I/O continues to be orders of magnitude slower than RAM, whereas reading and decompressing data in RAM is fast and highly parallelizable across the system’s CPU cores, yielding a significant performance increase. However, this came at the cost of increased complexity of process memory reconstruction and thus reduced the power of popular tools such as Volatility, Rekall, and Redline. In this document we introduce a method to retrieve compressed pages from the Windows 10 Memory Manager Virtual Store, thus providing forensics and auditing tools with a way to retrieve, examine, and reconstruct memory artifacts regardless of their storage location. Introduction Windows 10 moves pages between physical memory and the hard disk or the Store Manager’s virtual store when memory is constrained. Universal Windows Platform (UWP) applications leverage the Virtual Store any time they are suspended (as is the case when minimized). When a given page is no longer in the process’s working set, the corresponding Page Table Entry (PTE) is used by the OS to specify the storage location as well as additional data that allows it to start the retrieval process. In the case of a page file, the retrieval is straightforward because both the page file index and the location of the page within the page file can be directly retrieved.
    [Show full text]
  • Memory Protection File B [RWX] File D [RW] File F [R] OS GUEST LECTURE
    11/17/2018 Operating Systems Protection Goal: Ensure data confidentiality + data integrity + systems availability Protection domain = the set of accessible objects + access rights File A [RW] File C [R] Printer 1 [W] File E [RX] Memory Protection File B [RWX] File D [RW] File F [R] OS GUEST LECTURE XIAOWAN DONG Domain 1 Domain 2 Domain 3 11/15/2018 1 2 Private Virtual Address Space Challenges of Sharing Memory Most common memory protection Difficult to share pointer-based data Process A virtual Process B virtual Private Virtual addr space mechanism in current OS structures addr space addr space Private Page table Physical addr space ◦ Data may map to different virtual addresses Each process has a private virtual in different address spaces Physical address space Physical Access addr space ◦ Set of accessible objects: virtual pages frame no Rights Segment 3 mapped … … Segment 2 ◦ Access rights: access permissions to P1 RW each virtual page Segment 1 P2 RW Segment 1 Recorded in per-process page table P3 RW ◦ Virtual-to-physical translation + P4 RW Segment 2 access permissions … … 3 4 1 11/17/2018 Challenges of Sharing Memory Challenges for Memory Sharing Potential duplicate virtual-to-physical Potential duplicate virtual-to-physical translation information for shared Process A page Physical Process B page translation information for shared memory table addr space table memory Processor ◦ Page table is per-process at page ◦ Single copy of the physical memory, multiple granularity copies of the mapping info (even if identical) TLB L1 cache
    [Show full text]
  • Paging: Smaller Tables
    20 Paging: Smaller Tables We now tackle the second problem that paging introduces: page tables are too big and thus consume too much memory. Let’s start out with a linear page table. As you might recall1, linear page tables get pretty 32 12 big. Assume again a 32-bit address space (2 bytes), with 4KB (2 byte) pages and a 4-byte page-table entry. An address space thus has roughly 232 one million virtual pages in it ( 212 ); multiply by the page-table entry size and you see that our page table is 4MB in size. Recall also: we usually have one page table for every process in the system! With a hundred active processes (not uncommon on a modern system), we will be allocating hundreds of megabytes of memory just for page tables! As a result, we are in search of some techniques to reduce this heavy burden. There are a lot of them, so let’s get going. But not before our crux: CRUX: HOW TO MAKE PAGE TABLES SMALLER? Simple array-based page tables (usually called linear page tables) are too big, taking up far too much memory on typical systems. How can we make page tables smaller? What are the key ideas? What inefficiencies arise as a result of these new data structures? 20.1 Simple Solution: Bigger Pages We could reduce the size of the page table in one simple way: use bigger pages. Take our 32-bit address space again, but this time assume 16KB pages. We would thus have an 18-bit VPN plus a 14-bit offset.
    [Show full text]
  • Unikernel Monitors: Extending Minimalism Outside of the Box
    Unikernel Monitors: Extending Minimalism Outside of the Box Dan Williams Ricardo Koller IBM T.J. Watson Research Center Abstract Recently, unikernels have emerged as an exploration of minimalist software stacks to improve the security of ap- plications in the cloud. In this paper, we propose ex- tending the notion of minimalism beyond an individual virtual machine to include the underlying monitor and the interface it exposes. We propose unikernel monitors. Each unikernel is bundled with a tiny, specialized mon- itor that only contains what the unikernel needs both in terms of interface and implementation. Unikernel mon- itors improve isolation through minimal interfaces, re- Figure 1: The unit of execution in the cloud as (a) a duce complexity, and boot unikernels quickly. Our ini- unikernel, built from only what it needs, running on a tial prototype, ukvm, is less than 5% the code size of a VM abstraction; or (b) a unikernel running on a spe- traditional monitor, and boots MirageOS unikernels in as cialized unikernel monitor implementing only what the little as 10ms (8× faster than a traditional monitor). unikernel needs. 1 Introduction the application (unikernel) and the rest of the system, as defined by the virtual hardware abstraction, minimal? Minimal software stacks are changing the way we think Can application dependencies be tracked through the in- about assembling applications for the cloud. A minimal terface and even define a minimal virtual machine mon- amount of software implies a reduced attack surface and itor (or in this case a unikernel monitor) for the applica- a better understanding of the system, leading to increased tion, thus producing a maximally isolated, minimal exe- security.
    [Show full text]
  • X86 Memory Protection and Translation
    2/5/20 COMP 790: OS Implementation COMP 790: OS Implementation Logical Diagram Binary Memory x86 Memory Protection and Threads Formats Allocators Translation User System Calls Kernel Don Porter RCU File System Networking Sync Memory Device CPU Today’s Management Drivers Scheduler Lecture Hardware Interrupts Disk Net Consistency 1 Today’s Lecture: Focus on Hardware ABI 2 1 2 COMP 790: OS Implementation COMP 790: OS Implementation Lecture Goal Undergrad Review • Understand the hardware tools available on a • What is: modern x86 processor for manipulating and – Virtual memory? protecting memory – Segmentation? • Lab 2: You will program this hardware – Paging? • Apologies: Material can be a bit dry, but important – Plus, slides will be good reference • But, cool tech tricks: – How does thread-local storage (TLS) work? – An actual (and tough) Microsoft interview question 3 4 3 4 COMP 790: OS Implementation COMP 790: OS Implementation Memory Mapping Two System Goals 1) Provide an abstraction of contiguous, isolated virtual Process 1 Process 2 memory to a program Virtual Memory Virtual Memory 2) Prevent illegal operations // Program expects (*x) – Prevent access to other application or OS memory 0x1000 Only one physical 0x1000 address 0x1000!! // to always be at – Detect failures early (e.g., segfault on address 0) // address 0x1000 – More recently, prevent exploits that try to execute int *x = 0x1000; program data 0x1000 Physical Memory 5 6 5 6 1 2/5/20 COMP 790: OS Implementation COMP 790: OS Implementation Outline x86 Processor Modes • x86
    [Show full text]
  • Demand Paging from the OS Perspective
    HW 3 due 11/10 LLeecctuturree 1122:: DDeemmaanndd PPaaggiinngg CSE 120: Principles of Operating Systems Alex C. Snoeren MMeemmoorryy MMaannaaggeemmeenntt Last lecture on memory management: Goals of memory management ◆ To provide a convenient abstraction for programming ◆ To allocate scarce memory resources among competing processes to maximize performance with minimal overhead Mechanisms ◆ Physical and virtual addressing (1) ◆ Techniques: Partitioning, paging, segmentation (1) ◆ Page table management, TLBs, VM tricks (2) Policies ◆ Page replacement algorithms (3) 2 CSE 120 – Lecture 12 LLeecctuturree OOvveerrvviieeww Review paging and page replacement Survey page replacement algorithms Discuss local vs. global replacement Discuss thrashing 3 CSE 120 – Lecture 12 LLooccaalliityty All paging schemes depend on locality ◆ Processes reference pages in localized patterns Temporal locality ◆ Locations referenced recently likely to be referenced again Spatial locality ◆ Locations near recently referenced locations are likely to be referenced soon Although the cost of paging is high, if it is infrequent enough it is acceptable ◆ Processes usually exhibit both kinds of locality during their execution, making paging practical 4 CSE 120 – Lecture 12 DDeemmaanndd PPaaggiinngg ((OOSS)) Recall demand paging from the OS perspective: ◆ Pages are evicted to disk when memory is full ◆ Pages loaded from disk when referenced again ◆ References to evicted pages cause a TLB miss » PTE was invalid, causes fault ◆ OS allocates a page frame, reads
    [Show full text]
  • Virtual Memorymemory
    ChapterChapter 9:9: VirtualVirtual MemoryMemory ChapterChapter 9:9: VirtualVirtual MemoryMemory ■ Background ■ Demand Paging ■ Process Creation ■ Page Replacement ■ Allocation of Frames ■ Thrashing ■ Demand Segmentation ■ Operating System Examples Operating System Concepts 9.2 Silberschatz, Galvin and Gagne ©2009 BackgroundBackground ■ Virtual memory – separation of user logical memory from physical memory. ● Only part of a program needs to be in memory for execution (can really execute only one instruction at a time) Only have to load code that is needed Less I/O, so potential performance gain More programs in memory, so better resource allocation and throughput ● Logical address space can therefore be much larger than physical address space. Programs can be larger than memory Less management required by programmers ● Need to allow pages to be swapped in and out ■ Virtual memory can be implemented via: ● Demand paging (for both paging and swapping) ● Demand segmentation Operating System Concepts 9.3 Silberschatz, Galvin and Gagne ©2009 VirtualVirtual MemoryMemory ThatThat isis LargerLarger ThanThan PhysicalPhysical MemoryMemory Operating System Concepts 9.4 Silberschatz, Galvin and Gagne ©2009 AA Process’sProcess’s Virtual-addressVirtual-address SpaceSpace Operating System Concepts 9.5 Silberschatz, Galvin and Gagne ©2009 VirtualVirtual MemoryMemory hashas ManyMany UsesUses ■ In addition to separating logical from physical memory, virtual memory can enable processes to share memory ■ Provides following benefits: ● Can share system
    [Show full text]
  • An Evolutionary Study of Linux Memory Management for Fun and Profit Jian Huang, Moinuddin K
    An Evolutionary Study of Linux Memory Management for Fun and Profit Jian Huang, Moinuddin K. Qureshi, and Karsten Schwan, Georgia Institute of Technology https://www.usenix.org/conference/atc16/technical-sessions/presentation/huang This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. An Evolutionary Study of inu emory anagement for Fun and rofit Jian Huang, Moinuddin K. ureshi, Karsten Schwan Georgia Institute of Technology Astract the patches committed over the last five years from 2009 to 2015. The study covers 4587 patches across Linux We present a comprehensive and uantitative study on versions from 2.6.32.1 to 4.0-rc4. We manually label the development of the Linux memory manager. The each patch after carefully checking the patch, its descrip- study examines 4587 committed patches over the last tions, and follow-up discussions posted by developers. five years (2009-2015) since Linux version 2.6.32. In- To further understand patch distribution over memory se- sights derived from this study concern the development mantics, we build a tool called MChecker to identify the process of the virtual memory system, including its patch changes to the key functions in mm. MChecker matches distribution and patterns, and techniues for memory op- the patches with the source code to track the hot func- timizations and semantics. Specifically, we find that tions that have been updated intensively.
    [Show full text]
  • Virtual Memory HW
    Administrivia • Lab 1 due Friday 12pm (noon) • We give will give short extensions to groups that run into trouble. But email us: - How much is done and left? - How much longer do you need? • Attend section Friday at 12:30pm to learn about lab 2. 1 / 37 Virtual memory • Came out of work in late 1960s by Peter Denning (lower right) - Established working set model - Led directly to virtual memory 2 / 37 Want processes to co-exist 0x9000 OS 0x7000 gcc 0x4000 bochs/pintos 0x3000 emacs 0x0000 • Consider multiprogramming on physical memory - What happens if pintos needs to expand? - If emacs needs more memory than is on the machine? - If pintos has an error and writes to address 0x7100? - When does gcc have to know it will run at 0x4000? - What if emacs isn’t using its memory? 3 / 37 Issues in sharing physical memory • Protection - A bug in one process can corrupt memory in another - Must somehow prevent process A from trashing B’s memory - Also prevent A from even observing B’s memory (ssh-agent) • Transparency - A process shouldn’t require particular physical memory bits - Yes processes often require large amounts of contiguous memory (for stack, large data structures, etc.) • Resource exhaustion - Programmers typically assume machine has “enough” memory - Sum of sizes of all processes often greater than physical memory 4 / 37 Virtual memory goals Is address No: to fault handler kernel legal? . virtual address Yes: phys. 0x30408 addr 0x92408 load MMU memory . • Give each program its own virtual address space - At runtime, Memory-Management Unit relocates each load/store - Application doesn’t see physical memory addresses • Also enforce protection - Prevent one app from messing with another’s memory • And allow programs to see more memory than exists - Somehow relocate some memory accesses to disk 5 / 37 Virtual memory goals Is address No: to fault handler kernel legal? .
    [Show full text]
  • Linear Algebra Libraries
    Linear Algebra Libraries Claire Mouton [email protected] March 2009 Contents I Requirements 3 II CPPLapack 4 III Eigen 5 IV Flens 7 V Gmm++ 9 VI GNU Scientific Library (GSL) 11 VII IT++ 12 VIII Lapack++ 13 arXiv:1103.3020v1 [cs.MS] 15 Mar 2011 IX Matrix Template Library (MTL) 14 X PETSc 16 XI Seldon 17 XII SparseLib++ 18 XIII Template Numerical Toolkit (TNT) 19 XIV Trilinos 20 1 XV uBlas 22 XVI Other Libraries 23 XVII Links and Benchmarks 25 1 Links 25 2 Benchmarks 25 2.1 Benchmarks for Linear Algebra Libraries . ....... 25 2.2 BenchmarksincludingSeldon . 26 2.2.1 BenchmarksforDenseMatrix. 26 2.2.2 BenchmarksforSparseMatrix . 29 XVIII Appendix 30 3 Flens Overloaded Operator Performance Compared to Seldon 30 4 Flens, Seldon and Trilinos Content Comparisons 32 4.1 Available Matrix Types from Blas (Flens and Seldon) . ........ 32 4.2 Available Interfaces to Blas and Lapack Routines (Flens and Seldon) . 33 4.3 Available Interfaces to Blas and Lapack Routines (Trilinos) ......... 40 5 Flens and Seldon Synoptic Comparison 41 2 Part I Requirements This document has been written to help in the choice of a linear algebra library to be included in Verdandi, a scientific library for data assimilation. The main requirements are 1. Portability: Verdandi should compile on BSD systems, Linux, MacOS, Unix and Windows. Beyond the portability itself, this often ensures that most compilers will accept Verdandi. An obvious consequence is that all dependencies of Verdandi must be portable, especially the linear algebra library. 2. High-level interface: the dependencies should be compatible with the building of the high-level interface (e.
    [Show full text]