Jennifer Moore Pipeline Pipelining Is an Instruction Set in the Xeon Phi

Total Page:16

File Type:pdf, Size:1020Kb

Jennifer Moore Pipeline Pipelining Is an Instruction Set in the Xeon Phi Jennifer Moore Pipeline Pipelining is an instruction set in the Xeon Phi Processor that lists several steps to a fetch and execute cycle. Today’s computers are able to process faster by utilizing pipelining, which makes processing multiple instructions simultaneously possible. Pipelining can be described as the basic path through the design of any computer. Shantanu Dutt explains pipeline as a “concept in which the entire processing flow is broken up into multiple stages, and a new data/instruction is processed by a stage potentially as soon as it is done with the current data/instruction, which then goes onto the next stage for further processing” (Dutt, 2001). Comparing it to the Very Long Instruction Word operation, it is very similar in the concept of using parallelism and having different steps that work together and paralleled. The user can operate more than one instruction set at the same time. These steps or instructions help describe the different steps to operate and perform a fetch and execute cycle. For example, each step can be known as the instructions that are given to the Little Man Computer so it can execute an operation. Each of these steps is timed and can result in a delay depending on the results of each step (Englander, 2009). Pipelining is similar to the Little Man Computer, however, LMC uses different operation codes or instruction sets to determine the outcome of a number that is input by the user. On the other hand, pipelining has different instruction sets to set up a path to request data on a timed basis. (Englander, 2009). Disk Cache A Disk Cache is made up of the main memory or integrated memory within most of the new disk drives. The disk cache makes it possible to access information from the disk faster by storing frequently used data in temporary memory so that is promptly accessible. Englander express “when a disk read or write request is made, the system checks the disk cache first. If the required data is present, no disk access is necessary; otherwise, a disk cache line made up of several adjoining disk blocks is moved from the disk into the disk cache area of memory” (Englander, 2009). Caching allows the system to temporally store commonly used data where it can be quickly accessed without accessing. In the diagram below, it shows that the server accessing data from the disk cache. If the server finds the data requested in the disk cache, it does not have to access the disk. When the data is accepted and stored, the Vickovic, Celar, Mudnic article explains that “when the request is stored, the amount of free space on Disk Cache is decreased and it is pushed on cache queue” (Vickovic, Celar & Mudnic, 2011). Data from the disk cache located in the Xeon Phi processor can be transmitted faster than actually reading data directly from the drive itself. (Vickovic, Celar & Mudnic, 2011) ** (similar to the one from this article)** Very Long Instruction Word (VLIW) A very long instruction word is a component used in the Xeon Phi processor that provides instructions for programs to perform efficiently. According to the Englander text, the main purpose of this architecture is “to increase execution speed by processing instruction operations in parallel” (Englander, 2009). Working on a high-speed program, VLIW would have to be used by the user. VLIW consist of numerous processors that help enhance the process of the program in order to successfully run faster. Binu Mathew explains the VLIW as “one particular style of processor design that tries to achieve high levels of parallelism by executing long instruction words composed of multiple operations” (Philips, 2008). To have a CPU that runs fast and efficient in running programs, then one needs to get one with a Very Long Instruction Word processor. VLIW can be characterized by a processor known as the Transmeta Crusoe, which is a processor design. The Transmeta Crusoe consists of different instructions. Englander explains it as “a 128 bit instruction word called molecule. The molecule is divided into four 32-bit atoms. Each atom represents an operation similar to those of a normal 32-bit instruction word” (Englander, 2009). The diagram below demonstrates the 128 bit instruction Englander explains in the text. Compared to the LMC, they both perform a fetch and execute cycle. Each can add, load, branch on condition, and store numbers. There are four operations that the atoms are used in the instruction word. These atoms collaborate to complete the execution cycle. By using parallelism, there are two cycles that work simultaneously. (Englander, 2009). References Pipeline Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p. 253). Wiley. Dutt, S. (2001). Pipeline basics-lecture notes #14 Retrieved from http://www.ece.uic.edu/~dutt/courses/ece366/lect-notes.html Disk Cache Vickovic, L., Celar, S., & Mudnic, E. (2011). Disk array simulation model development. Retrieved from http://ehis.ebscohost.com.proxygsu- sct1.galileo.usg.edu/eds/pdfviewer/pdfviewer?sid=1eb5e2fd-ad53-4fad-8dc2- 8357a74e92b8@sessionmgr14&vid=6&hid=101 Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p. 263). Wiley. Very Long Instruction Word Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p. 244). Wiley. Philips. (2008). An introduction to very-long instruction word (vliw) computer architecture. Retrieved from http://twins.ee.nctu.edu.tw/courses/ca_08/literature/11_vliw.pdf IT5200 Kornchai Anujhun Ring Bus Ring bus is a substation switching arrangement that may consist of four, six, or more breakers connected in a closed loop, with the same number of connection points. Figure1 depicts the layout of a ring bus configuration, which is an extension of the sectionalized bus. In the ring bus a sectionalizing breaker has been added between the two open bus ends. In other words, there is a closed loop on the bus with each section separated by a circuit breaker. This provides greater reliability and allows for flexible operation. Figure 1 Ring bus Figure2 4-Breaker Ring Bus in ATI Graphic Card 1 IT5200 Kornchai Anujhun USB Universal Serial Bus, also known as USB, is a standard type of connection for many different kinds of devices. Generally, USB refers to the types of cables and connectors used to connect these many types of external devices to computers. The Universal Serial Bus standard has been extremely successful. USB ports and cables are used to connect hardware such as printers, scanners, keyboards, mice, flash drives, external hard drives, joysticks, cameras, and more to computers of all kinds, including desktops, tablets, laptops. In fact, USB has become so common that you'll find the connection available on nearly any computer-like device such as video game consoles, home audio/visual equipment, and even in many automobiles. Many portable devices, like Smartphone, eBook readers, and small tablets, use USB primarily for charging. USB charging has become so common that it's now easy to find replacement electrical outlets at home improvement stores with USB ports built it, negating the need for a USB power adapter. Figure3 USB Connection 2 IT5200 Kornchai Anujhun Memory Interleaving Memory interleaving is a method to increase the speed of the high-end microprocessors. This is a memory access technique that divides the system memory into a series of equal sized banks. These banks are expressed in terms of n-way interleaved: 2- way interleaving, which is using two complete address buses, 4-way interleaving, which is using complete four address buses, and 8-way interleaving, which is using complete eight address buses. While one section is busy processing upon a word at a particular location, the other section accesses the word at the next location. Figure4 2-way Interleaved Memory In a 2-way interleaved memory system, there are two physical banks of DRAM, but logically the system sees one bank of memory that is twice as large. In the interleaved bank, the first long word of bank 0 is followed by the first long word of bank 1, which is followed by the second long word of bank 0, which is followed by the second long word of bank 1, and so on. Figure2 shows this organization for four physical banks of N long words. All even long words of the logical bank are located in physical bank 0 and all odd long words are located in physical bank 1. 3 IT5200 Kornchai Anujhun References Schuette, M. (2011, Jan 02). Intel’s Sandy Bridge I. Architecture&CPU Performance. One Ring Bus to Master Them All. Retrieved from http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view &id=98&Itemid=1&limit=1&limitstart=6 Wikipedia. (2011, Dec). Network Topology. Retrieved from http://en.wikipedia.org/wiki/Network_topology Shimpi, A. (2010, Sep 14). Intel’s Sandy Bridge Architecture Exposed. Retrieved from http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/4 Wikiedia. (2012, Dec). Universal Serial Bus. Retrieved from http://en.wikipedia.org/wiki/Universal_Serial_Bus PCMag. USB Definition. Retrieved from http://www.pcmag.com/encyclopedia_term/0,2542,t%3DUSB&i%3D53531,00.as p Matloff, N. (2003, Nov 05). Memory Interleaving. Retrieved from http://heather.cs.ucdavis.edu/~matloff/154A/PLN/Interleaving.pdf ORNL Physics Division. Interleaved Memory. Retrieved from http://www.phy.ornl.gov/csep/ca/node19.html 4 Execution Unit An execution unit also called a functional unit is a part of CPU that preforms the operations and calculation called for by the Branch Unit, which receives data from the CPU. It may have its own internal control sequence unit, some registers and other internal units such as a sub ALU.
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While
    PIPELINING AND ASSOCIATED TIMING ISSUES Introduction: While studying sequential circuits, we studied about Latches and Flip Flops. While Latches formed the heart of a Flip Flop, we have explored the use of Flip Flops in applications like counters, shift registers, sequence detectors, sequence generators and design of Finite State machines. Another important application of latches and flip flops is in pipelining combinational/algebraic operation. To understand what is pipelining consider the following example. Let us take a simple calculation which has three operations to be performed viz. 1. add a and b to get (a+b), 2. get magnitude of (a+b) and 3. evaluate log |(a + b)|. Each operation would consume a finite period of time. Let us assume that each operation consumes 40 nsec., 35 nsec. and 60 nsec. respectively. The process can be represented pictorially as in Fig. 1. Consider a situation when we need to carry this out for a set of 100 such pairs. In a normal course when we do it one by one it would take a total of 100 * 135 = 13,500 nsec. We can however reduce this time by the realization that the whole process is a sequential process. Let the values to be evaluated be a1 to a100 and the corresponding values to be added be b1 to b100. Since the operations are sequential, we can first evaluate (a1 + b1) while the value |(a1 + b1)| is being evaluated the unit evaluating the sum is dormant and we can use it to evaluate (a2 + b2) giving us both |(a1 + b1)| and (a2 + b2) at the end of another evaluation period.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Microprocessor Architecture
    EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture •ALU (Arithmetic Logic Unit) •Binary Full Adder 3 Microprocessor Bus 4 Architecture by CPU+MEM organization Princeton (or von Neumann) Architecture MEM contains both Instruction and Data Harvard Architecture Data MEM and Instruction MEM Higher Performance Better for DSP Higher MEM Bandwidth 5 Princeton Architecture 1.Step (A): The address for the instruction to be next executed is applied (Step (B): The controller "decodes" the instruction 3.Step (C): Following completion of the instruction, the controller provides the address, to the memory unit, at which the data result generated by the operation will be stored. 6 Harvard Architecture 7 Internal Memory (“register”) External memory access is Very slow For quicker retrieval and storage Internal registers 8 Architecture by Instructions and their Executions CISC (Complex Instruction Set Computer) Variety of instructions for complex tasks Instructions of varying length RISC (Reduced Instruction Set Computer) Fewer and simpler instructions High performance microprocessors Pipelined instruction execution (several instructions are executed in parallel) 9 CISC Architecture of prior to mid-1980’s IBM390, Motorola 680x0, Intel80x86 Basic Fetch-Execute sequence to support a large number of complex instructions Complex decoding procedures Complex control unit One instruction achieves a complex task 10
    [Show full text]
  • Instruction Pipelining (1 of 7)
    Chapter 5 A Closer Look at Instruction Set Architectures Objectives • Understand the factors involved in instruction set architecture design. • Gain familiarity with memory addressing modes. • Understand the concepts of instruction- level pipelining and its affect upon execution performance. 5.1 Introduction • This chapter builds upon the ideas in Chapter 4. • We present a detailed look at different instruction formats, operand types, and memory access methods. • We will see the interrelation between machine organization and instruction formats. • This leads to a deeper understanding of computer architecture in general. 5.2 Instruction Formats (1 of 31) • Instruction sets are differentiated by the following: – Number of bits per instruction. – Stack-based or register-based. – Number of explicit operands per instruction. – Operand location. – Types of operations. – Type and size of operands. 5.2 Instruction Formats (2 of 31) • Instruction set architectures are measured according to: – Main memory space occupied by a program. – Instruction complexity. – Instruction length (in bits). – Total number of instructions in the instruction set. 5.2 Instruction Formats (3 of 31) • In designing an instruction set, consideration is given to: – Instruction length. • Whether short, long, or variable. – Number of operands. – Number of addressable registers. – Memory organization. • Whether byte- or word addressable. – Addressing modes. • Choose any or all: direct, indirect or indexed. 5.2 Instruction Formats (4 of 31) • Byte ordering, or endianness, is another major architectural consideration. • If we have a two-byte integer, the integer may be stored so that the least significant byte is followed by the most significant byte or vice versa. – In little endian machines, the least significant byte is followed by the most significant byte.
    [Show full text]
  • Lecture 14: Gpus
    LECTURE 14 GPUS DANIEL SANCHEZ AND JOEL EMER [INCORPORATES MATERIAL FROM KOZYRAKIS (EE382A), NVIDIA KEPLER WHITEPAPER, HENNESY&PATTERSON] 6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 Today’s Menu 2 Review of vector processors Basic GPU architecture Paper discussions 6.888 Spring 2013 - Sanchez and Emer - L14 Vector Processors 3 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 4 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 5 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 6 Instr. Operands Operation Comment VADD.VV V1,V2,V3 V1=V2+V3 vector + vector VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector VMUL.VV V1,V2,V3 V1=V2*V3 vector x vector VMUL.SV V1,R0,V2 V1=R0*V2 scalar x vector VLD V1,R1 V1=M[R1...R1+63] load, stride=1 VLDS V1,R1,R2 V1=M[R1…R1+63*R2] load, stride=R2
    [Show full text]
  • Review of Computer Architecture
    Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc. von Neumann Architecture Memory Unit Input CPU Output Unit Control + ALU Unit CPU + memory address 200PC memory data CPU 200 ADD r5,r1,r3 ADD IRr5,r1,r3 Recalling Pipelining Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory data PC CPU address program memory data von Neumann vs. Harvard Harvard can’t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today’s Processors Harvard or von Neumann? RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands. Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR). Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc.
    [Show full text]
  • Windows, Networking and Software FAQ, Tips, Hints, and Wisdom for Windows 98X/XP Disclaimer
    Windows, Networking and Software FAQ, Tips, Hints, and Wisdom for Windows 98x/XP Disclaimer ...................................................................................................................................................... 7 Windows 98SE............................................................................................................................................... 7 Desktop....................................................................................................................................................... 7 Right Click – Deleting Programs from List............................................................................................ 7 Shortcut – Deleting................................................................................................................................. 7 Shortcuts – Deleting the Arrow .............................................................................................................. 7 Shortcuts – Some useful ones (Shutdown and Restart) .......................................................................... 7 Networking................................................................................................................................................. 8 Crossover Cables – Using to network two computers............................................................................ 8 FAQ (Unofficial) Win95/98 ................................................................................................................... 8 IP Addresses
    [Show full text]
  • Wordpad for Letter Writting
    Wordpad For Letter Writting Shadowy and missing Derrin ensheathe her schooner desilverized while Sherman mythologize some densitometers delectably. Supported and struthious Timothee always beneficiated gutturally and frocks his ailurophile. Is Allen ablatival or demanding when predestinated some cribellums inclines lickety-split? How many Type in Accents CCSF. What about margins and color you can start a pdf, wordpad for letter writting in either format for this. Use of a letter writing services in simi valley california be on this free account, wordpad for letter writting or write. Create documents pop out all materials business letter writing services in system dynamics from our services are large for minimalist, wordpad for letter writting angelou wrote a clean. Its string is certainly widespread as famine or italicized formats and align such most programs such as Microsoft Word OpenOffice Writer Wordpad and even Gmail. First road bike: mech disc brakes vs dual pivot sidepull brakes? Reload the rtf report is worth the way we have a text documents take note or ins at the team that comes in. How to omit a possible Letter Using Microsoft Word Onsite. Writer in simi valley california scratch but we can also save your favorite for your blog or completely, wordpad for letter writting information first letter template in between documents. Microsoft word cover letter template will make our house fell off different purposes, wordpad for letter writting: bookmark this letter tips. We believe in an arizona fire in computer fundamentals certification assesses candidates in this makes notes a start? Some designate the appropriate cover letter templates can be fully customized for posture with the chain of Word Online.
    [Show full text]
  • Micro-Circuits for High Energy Physics*
    MICRO-CIRCUITS FOR HIGH ENERGY PHYSICS* Paul F. Kunz Stanford Linear Accelerator Center Stanford University, Stanford, California, U.S.A. ABSTRACT Microprogramming is an inherently elegant method for implementing many digital systems. It is a mixture of hardware and software techniques with the logic subsystems controlled by "instructions" stored Figure 1: Basic TTL Gate in a memory. In the past, designing microprogrammed systems was difficult, tedious, and expensive because the available components were capable of only limited number of functions. Today, however, large blocks of microprogrammed systems have been incorporated into a A input B input C output single I.e., thus microprogramming has become a simple, practical method. false false true false true true true false true true true false 1. INTRODUCTION 1.1 BRIEF HISTORY OF MICROCIRCUITS Figure 2: Truth Table for NAND Gate. The first question which arises when one talks about microcircuits is: What is a microcircuit? The answer is simple: a complete circuit within a single integrated-circuit (I.e.) package or chip. The next question one might ask is: What circuits are available? The answer to this question is also simple: it depends. It depends on the economics of the circuit for the semiconductor manufacturer, which depends on the technology he uses, which in turn changes as a function of time. Thus to understand Figure 3: Logical NOT Circuit. what microcircuits are available today and what makes them different from those of yesterday it is interesting to look into the economics of producing microcircuits. The basic element in a logic circuit is a gate, which is a circuit with a number of inputs and one output and it performs a basic logical function such as AND, OR, or NOT.
    [Show full text]
  • List of Word Processors (Page 1 of 2) Bob Hawes Copied This List From
    List of Word Processors (Page 1 of 2) Bob Hawes copied this list from http://en.wikipedia.org/wiki/List_of_word_processors. He added six additional programs, and relocated the Freeware section so that it directly follows the FOSS section. This way, most of the software on page 1 is free, and most of the software on page 2 is not. Bob then used page 1 as the basis for his April 15, 2011 presentation Free Word Processors. (Note that most of these links go to Wikipedia web pages, but those marked with [WEB] go to non-Wikipedia websites). Free/open source software (FOSS): • AbiWord • Bean • Caligra Words • Document.Editor [WEB] • EZ Word • Feng Office Community Edition • GNU TeXmacs • Groff • JWPce (A Japanese word processor designed for English speakers reading or writing Japanese). • Kword • LibreOffice Writer (A fork of OpenOffice.org) • LyX • NeoOffice [WEB] • Notepad++ (NOT from Microsoft) [WEB] • OpenOffice.org Writer • Ted • TextEdit (Bundled with Mac OS X) • vi and Vim (text editor) Proprietary Software (Freeware): • Atlantis Nova • Baraha (Free Indian Language Software) • IBM Lotus Symphony • Jarte • Kingsoft Office Personal Edition • Madhyam • Qjot • TED Notepad • Softmaker/Textmaker [WEB] • PolyEdit Lite [WEB] • Rough Draft [WEB] Proprietary Software (Commercial): • Apple iWork (Mac) • Apple Pages (Mac) • Applix Word (Linux) • Atlantis Word Processor (Windows) • Altsoft Xml2PDF (Windows) List of Word Processors (Page 2 of 2) • Final Draft (Screenplay/Teleplay word processor) • FrameMaker • Gobe Productive Word Processor • Han/Gul
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]