P11MCA1 & P8MCA1 - ADVANCED COMPUTER ARCHITECTURE

UNIT V PROCESSORS AND MEMORY HIERARCHY

ADVANCED PROCESSOR TECHNOLOGY:

1. Name the major processor families. Major processor families are CISC, RISC, Superscalar, VLIW, Super-pipelined, Vector and Symbolic processors. 2. What are the uses of scalar and vector processors? Scalar and Vector processors are used for numerical computation. 3. What are the uses of symbolic processors? Symbolic processors are used for Artificial Intelligence Applications. 4. What is the recent trend in processor technology? Clock rates of various processors are gradually increasing because of improvement in implementation technology. CPI rates are slowly decreased due to improvement in software and hardware technologies. 5. Compare the performance and features of various processor families based on their clock rates and processor rates. S.No Processor Family Features CPI rate (MHz) 1. Complex Instruction Set Large, complex instruction set 33 - 50 1 – 20 Computers 2. Reduced Instruction Set Uses hardwired control, Small no. of 20 - 120 1 - 2 Computers instructions, Simple instructions 3. Superscalar Processors Multiple instruction issued at the 20 - 120 < 1, 2 same time 4. Very Long Instruction More functional units than super- Slow rates as High (more Word Processors scalars instructions access cycles 256-1024 bits per instruction should be for some Uses ROM and micro-programmed brought from instructions) control. ROM 5. Super-pipelined Use multi-phase clocks 100 - 500 High processors If multi-instruction issue is practiced CPI rate is low 6. Vector Super Computers Super-pipelined Low Use multiple functional units for scalar and vector operations Very costly (Refer figure above) 6. Write a note on Instruction pipelines. The execution cycle of a typical instruction consists of four phases: fetch, decode, execute and write-back. These instruction phases are executed by an instruction pipeline. The pipeline receives successive instructions from its input end and executes them in a streamlined, overlapped fashion as they flow through. A pipeline cycle is defined as the time required for each phase to complete its operation (per instruction), assuming equal delay in all phases. For full utilization of instruction pipeline, instruction issue rate should be 1. Otherwise the pipeline is under-utilized. Another reason for pipeline under-utilization could be combining pipeline stages. Eg. Fetch and decode, execute and write-back. In these cases the clock rate of the pipeline is halved. 7. Define Instruction Pipeline cycle. Instruction Pipeline cycle is the clock period of the instruction pipeline. 8. Define Instruction Issue Latency. Instruction Issue Latency is the time (in cylces) required between the issuing of two adjacent instructions. 9. Define Instruction Issue rate. Instruction Issue rate is the number of instructions issued per cycle. It is also called the degree of a . 10. Define Simple Operation Latency. Simple Operation Latency is the time taken to execute simple instructions (eg. Integer add, load, store etc) in cycles. 11. Define Complex Operation Latency. Complex Operation Latency is the time taken to execute complex instructions (eg. Divide, Cache misses, etc) in cycles. 12. Define Resource Conflict. Resource Conflict is the situation where two or more instructions demand use of the same functional unit at the same time. 13. What is a base scalar processor? A base scalar processor is a machine with one instruction issued per cycle, a one-cycle latency for a simple operation and a one-cycle latency between instruction issues.] 14. Write short notes on processors and co-processors. The CPU is a scalar processor. It may contain multiple functional units like integer ALU, floating point accelerator etc. Otherwise the special functional unit could be built on a co-processor and attached to the CPU. The co-processor executes instructions dispatched by the CPU, but it does not handle I/O operations. Eg. Floating point accelerator executing scalar data, vector processor executing vector operands, digital signal processor, Lisp processor for A.I applications. Co-processors are used to speed up numerical computations. They cannot be used alone. They act as a back-end to the central CPU and must be compatible with it. They are also called as attached processors or slave processors. Some of them may be more powerful than their hosts. Eg. Cray-Y-MP is attached to a mini computer.

15. Give some examples of co-processors for some processors and evaluate their performance metrics. Main processor name Co-processor name Speed (MHz) CPI for add CPI for log 1. 8086/8088 Intel 8087 5 70 700 2. Intel 80287 12.5 30 264 3. Intel 386DX Intel 387DX 33 12 210 4. Intel Intel i486 33 8 171 5. MC68020/68030 MC68882 40 56 574 6. Intel 386DX Weitek 3167 33 6 365 INSTRUCTION SET ARCHITECTURE: 16. Comment on Instruction Set Architecture. The instruction set of a computer specifies the set of instructions provided to the user for programming the machine. The complexity of the instruction set is decided by its format, data format, addressing modes, general purpose registers, op-code specifications and flow control mechanisms. There are two schools of thought on the design of the instruction-set architectures of a computer system. They are :  Complex Instruction set computers (CISC)  Reduced Instruction set computers (RISC) 17. Write short notes on Evolution of CISC. In early days of computing, due to the high cost of hardware, instruction sets were simple. But now hardware cost has dropped. Moreover, complex High Level Languages have evolved, widening the gap between the semantics of the HLL and the machine. In order to implement such complex functions of the HLL, more and more functions have been added to be built into the hardware. So instruction sets have become very large and complex. Evolution of Microprogramming and its widespread use has also added to this situation. Implementation of user-defined instructions through micro-code has even been made possible. Such trends resulted in the evolution of CISC processors.

18. Mention the characteristics of CISC Instruction set. i) CISC instruction sets may contain 120 to 350 instructions ii) They follow a variable instruction / data format iii) Most systems are built with 8 to 24 general purpose registers iv) More than a dozen addressing modes are available v) Large number of memory reference operations are provided vi) They may implement HLL statements in hardware or firmware. vii) They allow vector and symbol processing. 19. Write short notes on Evolution of RISC. After two decades of computing with CISC processors, it was realized that only 25% of the instructions of a CISC instruction set were used about 95% of the time, while others were rarely used. Rarely used instructions demanded long micro-codes to execute and were elaborate. They also used valuable chip area. Hence it was decided to remove such instructions from the instruction set and implement them through software. As these were rarely used, it did not matter much even if the software implementations were slow. Moreover the available chip space could be used for building more powerful RISC / Superscalar processors with on-chip caches and floating point units. 20. Mention the characteristics of RISC Instruction set. i) RISC instruction sets contain less than 100 instructions. ii) They follow a fixed instruction format (32 bits) iii) Only 3 to 5 different addressing modes are available. iv) Most of the instructions are register based v) Memory is accessed by LOAD and STORE instructions only. vi) They follow hard-wired instruction control vii) The time taken to execute most instructions is 1 cycle. viii) Large number of registers are used for fast context-switching ix) The entire processor is etched on a single VLSI chip. 21. What are the advantages of RISC over CISC ? i) Higher clock rate and lower CPI ii) Higher MIPS rates iii) Separate instruction and data caches with different access paths improve performance. 22. Give some examples for CISC computers. Intel i486, M68040, VAX/8600, IBM 390 23. Give some examples for RISC computers. , SPARC, MIPS R3000, IBM RS/6000 24. Give a few examples for RISC Scalar families. Intel Processors: [(4-bit)], [8086, 8088, 80186, 80286(16-bit),], [80386, 80386, 80586 (32-bit)] Motorola Processors: MC6800 (8-bit), MC68000 (16-bit), MC68020, MC68030, MC68040 (32-bit) 25. Expand : SPARC Scalable Processor Architecture 26. What are remarkable features of RISC Scalar processors? Less frequently used operations are handled by software. They rely on good compilers. They use instruction level parallelism through pipelining. 27. What are the general features of RISC processors? Most RISC processors have 32-bit instructions. Their instruction sets consist of 50 to 125 instructions. Special floating point units are provided either offline or online. They have a large instruction cache, data cache and a memory management unit and support pipelining. 28. Tabulate the comparative features of RISC and CISC processor architectures Architectural Characteristic CISC RISC Instruction set size Large (120 – 350) Small ( < 100 ) Instruction format Variable (16 – 64 bits per Fixed (32 bits) register based instruction) instructions Addressing modes 12 – 24 3 – 5 General Purpose registers 8 – 24 32 – 192 Cache design Unified cache for data and Separate cache for data and instructions instructions. Clock rate 33-50 MHz 50-150 MHz Cycles per Instruction (CPI) 2 – 15 < 1.5 CPU control Micro-coded using control Hardwired, no control memory memory

29. Discuss the features of DEC VAX 8600 CISC processor architecture.  The DEC VAX 8600 was introduced by Digital Equipment Corporation in the year 1985.  It uses CISC architecture with micro-programmed control.  The instruction set consists of 300 instructions with 20 different addressing modes.  It interfaces with SBI and Unibus.  The CPU consists of two functional units for concurrent execution of integer and floating point instructions.  A single cache of 16 KB is used to store both instructions and data.  16 General purpose registers are available in the instruction unit.  An Instruction pipeline with six stages is used for both integer and floating point calculations. Instruction unit pre-fetches and decodes instructions, handles branching operations and supplies operands to the two functional units in a pipelined fashion.  A translation look-aside buffer is used for translating a virtual address into a physical address.  If cache hit ratio is high and no branching exists, performance is very good. The CPI varies from 2 to 20 cycles. Therefore, clock rate is very low but CPI is high.  A Long sequence of instructions is used to control hardware operations.

30. Describe the architectural features of MC68040 CISC.  MC68040 is a 32-bit CISC microprocessor. It is packaged in a 0.8 µm HCMOS chip with 179 pins. It has 1.2 Million transistors, 20 MHz clock and a 40 MHz processor. It was released in the year 1990.  It has an instruction set of 113 instructions.  There are 16 general purpose registers. It also houses a 4KB data cache and a 4 KB instruction cache with 64 entries each. Memory management is supported by an address translation cache (Translation Look aside Buffer) which translates virtual addresses into effective physical addresses very quickly.  It supports data formats ranging between 8 to 80 bits and follows the IEEE floating point standard.  It supports 18 different addressing modes. Eg. Register direct, Register indirect, Indexed, Memory indirect, PC indirect, absolute, immediate etc.  The instructions can be classified into the following types: o Data movement instructions o Integer, BCD, Floating point arithmetic instructions o Logical and bit field manipulation instructions o Cache and memory management instructions o Shift Instructions o Multiprocessor communication instructions o Program and System Control instructions

 The integer unit is organized in a 6-stage pipeline and the floating point unit is organized in a 3-stage pipeline. It has an on-chip floating point unit and 8 80-bit floating point registers.  There are separate instruction and data buses 32-bits wide and a dual memory management unit for interleaved fetch of instructions and data. It also has a 32-bit address and data bus. Three simultaneous memory requests are allowed (data read, data write and instruction read)  It implements virtual demand paged memory management with 4 or 8 KB per page. There are 64 entries each in Address Translation Cache and Instruction Translation cache  It achieves a performance of 20 MIPS @ 25 MHz and 30 MIPS @ 60 MHz. 31. Mention the major features of Intel i486 CISC Scalar processor.  i486 CISC Scalar processor is a HCMOS IV technology processor housed in a 168-pin package. It has a 25 MHz clock. Processor speed is 33 MHz. It has 1.2 million transistors. It was released in 1989.  It is a 32-bit processor with 8 general purpose registers  There are 157 32-bit instructions in the instruction set and it follows 12 different addressing modes.  An 8KB cache is used for storing both instructions and data  There is an on-chip floating point unit with 8 registers, adder, multiplier and shifter.  It has a 5-stage pipeline and provides 4 levels of protection  It supports Segmented Paged Memory Management with 4 KB per page and the TLB can store 32 entries.  It achieves a performance of 24 MIPS @ 25 MHz. 32. Outline the features of NS32532 CISC microprocessor.  NS32532 is 32-bit CMOS processor housed in a 1.25 µm 175-pin package. It consists of 370 thousand transistors and has a clock of 30 MHz. It was introduced in the year 1987.  It has 63 32-bit instructions in its’ instruction set.  It provides 9 different addressing modes and has 8 general purpose registers.  It also has a 512 Byte code cache and 1 KB data cache.  It has 4 pipeline stages and provides 2 levels of protection.  It supports Paged Memory Management with 4KB per page and the TLB can store 64 entries.  It achieves a performance of 15 MIPS @ 30 MHz. 33. Describe the features of Sun SPARC CY7C601 RISC microprocessor.  It is 0.8 µm CMOS IV chip with 307 pins and 33 MHz clock. It was introduced in 1989.  It has 69 32-bit instructions.  It supports 7 different data types.  It implements a 4 stage instruction pipeline. It has a 32-bit instruction unit.  It has 136 registers divided into 8 windows  I also has an off-chip cache cum Memory management unit CY7C604 with 64 entries in TLB.  There is also an off-chip floating point unit CY7C602 with 32 registers and 64 bit pipeline.  It can concurrently carry out instruction unit and floating point unit operations  It has a performance of 24 MIPS @ 33 MHz and 50 MIPS @ 80 MHz. 34. Enumerate the features of Intel i860 RISC Scalar Processor.  I860 is a 32-bit RISC Scalar Processor introduced in 1989. It has a 40 MHz clock and is housed on an 1 µm CHMOS IV IC with 168 pins.  It has 82 32-bit instructions and supports 4 addressing modes.  It has 32 General Purpose registers, 4 KB code cache and 8 KB data cache and an on- chip Memory management unit which supports 4 KB/page  It also holds an on-chip 64-bit Floating point adder/multiplier unit, 32 floating point registers and a 3D graphics unit  It is capable of executing two instructions and two floating point operations simultaneously  It gives a performance of 40 MIPS / 60 MFlops 35. Outline the features of the Motorola M88100 RISC Scalar processor.  M88100 is a 32-bit RISC Scalar processor introduced in 1988. It has a 20 MHz clock and is housed on a 1 µm HCMOS chip with 180 pins. It has 1.2 million transistors.  It supports 51 instructions and 7 different datatypes.  There are three different instruction formats and it supports 4 addressing modes  There are 32 general purpose registers and it supports score-boarding  There is also an off-chip cache and Memory management unit (M88200) of 16 KB capacity  It supports segmented paging and memory access with delayed branch  It also houses an on-chip Floating point adder/multiplier unit with 32 Floating point registers implementing 64-bit arithmetic  The instruction unit and floating point unit can operate concurrently  It gives a performance of 17 MIPS and 6 MFlops @ 20 MHz. 36. Give an overview of the features of AMD 29000 RISC Scalar processor.  AMD 29000 is a 32-bit RISC Scalar processor introduced in 1988. It is etched on a 1.2 µm CMOS with 169 pins. It has a 30 MHz clock and a 40 MHz processor  It supports 112 32-bit instructions and has 192 General Purpose registers without windows. All registers support indirect addressing  It has an on-chip Memory management unit, 32 entry Translation Lookaside Buffer, 4-word prefetch buffer, 512 Byte branch target cache.  It also has an off-chip Floating point unit (AMD 29027) and an on-chip Floating Point Unit (AMD 29050)  It uses a 4-stage pipeline.  It gives a performance of 27 MIPS @ 40 MHz. 37. Discuss the characteristics and working of the Sun SPARC architecture. SPARC uses a scalable processor architecture. Scalability in this case, refers to the use of a different number of register windows in different implementations. Sun SPARC is derived from the original Berkeley RISC design. Consider the architecture of the Cypress CY7C601 SPARC Processor and the CY7C602 Floating point unit. The SPARC runs each procedure with a set of 32 32-bit Instruction unit registers. Eight of these are global registers and are shared by all procedures. The remaining 24 are window registers associated with only one procedure. Windows are overlapped across procedures. Each register window is divided into 3 eight register sections namely Ins, Locals, and Outs. The local registers are locally addressable by each procedure. Ins and Outs are shared among procedures. The calling procedure passes parameters to the called procedure via its Outs (r8-r15). The same set of registers are the Ins registers of the called procedure. The window of the currently running procedure is called the active window. Its address is stored in the current window pointer. A window invalid mask is used to indicate which window is invalid. The advantage of using register windows is that overlapping windows can significantly save the time required for inter-procedure communications, resulting in much faster context switching among co-operative procedures.

Three basic instruction formats are used each of length 32 bits. 32 single precision and 16 double precision floating point registers are available on the on-chip floating point unit. Fourteen out of 69 instructions in the instruction set are floating point operations. The trap base register serves as a pointer to a trap handler. Special registers for computing 64 bit products are also provided. It gives a performance of 50 MIPS with 80 MHz clock (BIT 70) and 200 MIPS with a 200 MHz clock (GaAs SPARC). 38. What are the constraints faced with conversion of CISC programs to run on RISC processors? i. Converting a CISC program to an equivalent RISC program increases the code length by only 40% (based on program behavior) ii. Increase in clock rate and reduction in CPI iii. More instruction traffic and greater memory demand iv. Register decoding system is more complicated although more no. of registers are available. v. Longer register access time places a greater demand on the compiler to manage register window functions. vi. Hardwired control is less flexible and more error-prone. 39. What are the future trends of RISC technology? a. Optimal sizes of the register set I-cache and D-cache can be determined based on the applications b. 64-bit integer ALU, multiprocessor support, faster inter-processor synchronization, hardware support for message passing and special functional units for I/O and graphics can be incorporated. 40. What is the current trend in processor technology? Merging of positive features from different processor families, Eg. VAX 9000, Motorola 88100 and i586 are built with mixed features from RISC and CISC. 41. Outline the features of Intel i860 RISC Scalar Processor. (Refer to figure in next page) There are nine functional units. They are Instruction cache, Data Cache, Memory management unit, Bus Control unit, RISC Integer Unit, Floating point control unit, Graphics unit, Pipelined Adder unit and Pipelined Multiplier Unit.

i.The address buses are 32 bits wide and data bus is 64 bits wide. ii.The RISC Integer Unit is a 32 bit ALU that executes load, store, integer, bit and control instructions. It fetches instructions for the Floating point Control unit also. iii.The instruction cache is a set associative memory with 32 Bytes per cache block. Its total size is 4 KB. It transfers 64 bits per clock cycle ie., 320 MB/sec @ 40 MHz clock speed. iv.The data cache is a two way set associative memory of 8 KB size. It transfers 128 bits per clock cycle ie., 640 KB/s @ 40 MHz clock speed. Write policy is used. Caching can be initiated by software also. v.The bus control unit co-ordinates 64 bit data transfer in and out of the processor. vi.The memory management unit is a 4 KB virtual memory of 232 bytes via TLB. i860 and i486 can be used jointly in a heterogeneous multiprocessor system for development of compatible OS kernels. vii.Multiplier and Adder units can be used simultaneously or separately. Special dual operation floating point instructions that add & multiply and subtract & multiply are available. Integer and floating point instructions can be executed concurrently. viii.Graphics unit executes integer operations corresponding to 8-, 16-, or 32-bit pixel data types. It supports 3D drawing in a frame buffer with color intensity, shading and hidden surface elimination. A merge register accumulates results of multiple operations. ix.The instruction set consists of 82 instructions out of which 42 are RISC integer, 24 are floating point, 10 are graphics and 6 are assembler pseudo operations. x.The system gives a peak performance of 80 MFlops with single-precision arithmetic, 60 MFlops with double precision arithmetic and 40 MIPS with 32-bit integer operations @ 40 MHz clock speed. SUPER SCALAR AND VECTOR PROCESSORS:

42. What is a scalar processor ? A processor that executes one instruction per cycle, and in which only one completed instruction is produced by the pipeline per cycle is called a scalar processor. 43. What is a super scalar processor? A processor in which multiple instruction pipelines are used and where multiple instructions are issued per cycle and multiple results are generated per cycle is called a superscalar processor. 44. What is a vector processor? A processor that executes vector instructions on arrays of data, ie., each instruction involves a string of repeated operations, is called a vector processor. It is a good candidate for pipelining. It produces one result per cycle.

45. Explain the characteristic features of Superscalar processors. i. Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent instructions can be executed in parallel without causing a wait state. Data dependencies and resource conflicts may hamper parallelism. Therefore the amount of parallelism that can be used depends on the code type. On an average a program without loop unrolling is observed to exhibit parallelism of 2 instructions. Therefore instruction issue degree has been limited to 2 to 5 in practice for a general processor. ii. Let us consider a Superscalar processor pipeline with 3 instruction pipelines in parallel for a triple-issue processor. A superscalar processor that can issue m instructions per cycle is said to have a degree m. If m instructions are executed in parallel, the processor is fully utilized. However this is not practically feasible in all clock cycles. So some of the pipelines may be in the wait state. An operation latency of 1 cycle is possible in the ideal case only. In order to achieve a higher degree of parallelism we use an optimizing compiler. 46. When can a superscalar processor achieve the same performance as a vector processor? Superscalar processors were originally developed as an alternative to vector processors. A superscalar machine that can issue a fixed point floating point, load, and branch all in one cycle achieves the same effective parallelism as a vector processor which executes a vector load, chained into a vector add with one element loaded and added per cycle.

47. Describe the features of a typical superscalar architecture for a RISC processor.  Multiple instruction pipelines are used.  Instruction cache supplies multiple instructions per fetch  However actual no. of instructions issued to various functional units may vary in each cycle. This is dependent on amount of data dependence and resource conflicts among simultaneously decoded instructions.  Multiple functional units are built into the integer and floating point unit.  Multiple data buses exist among functional units  Due to reduced CPI and higher clock rates, Superscalar processors outperform scalar processors.  Maximum issuance rate is 2.5  Maximum 32 registers are found in each register file in IU and FPU 48. Name some representative superscalar processors. IBM RS/6000, DEC 21064, Intel i960CA 49. How can the amount of parallelism be improved in Superscalar processors? Reservation stations and re-order buffers can be used to establish instruction windows. This provides support for instruction look-ahead and internal data forwarding. 50. Explain the features of IBM RS/6000 with a block diagram.

 RISC System 6000 was announced by IBM in early 1990  It is a superscalar processor built with CMOS technology  It has a 30 MHz clock and its size is 1 m  It gives a performance of 34 MIPS and 11 Mflops @ 25 MHz on POWERStation 530  It contains 32 32-bits General purpose registers, an 8 KB I-Cache, 64 KB D-cache with separate TLBs  It has three functional units called the branch processor, fixed point unit and floating point unit that can operate in parallel.  The branch processor can arrange the execution of upto five instructions per cycle: 1 branch instruction in branch processor, 1 fixed point instruction in Fixed point unit, 1 condition register instruction in Branch Processor and 1 Floating point multiply-add instruction in Fixed Point Unit (counted as 2 instructions)  It has an on-chip Fixed point unit which can perform , addition, , subtraction on 64-bit floating point numbers as per IEEE 754 standard.  It is hardwired. It uses a number of wide buses: 32 bit for Fixed point unit and 64 bit for Floating point unit and 128 bit for I-cache and D-cache. So high instruction and data bandwidths are available.  It is optimized to perform well in number crunching applications (scientific and engineering applications) and multi-user commercial applications.  Number of workstations and servers have been built using RS/6000 eg. POWERStation 530 51. Outline the features of Intel i960CA Superscalar processor.  It was released in 1986  Performance specifications: 25 MHz clock, 30 VAX/MIPS peak at 25 MHz  It is used in realtime, embedded, system control and multiprocessor applications.  It issues upto 3 instructions/cycle ( register, memory and control.)  7 functional units are available for concurrent use.  It has 1 KB I-cache 1.5 KB RAM, and 4 channel I/O with DMA and Parallel decode, multiported registers  There is an on-chip Floating Point unit  Fast multimode interrupt and Multitask control is also supported 52. Outline the characteristic features of DEC21064 Superscalar processor  DEC21064 was introduced in the year 1992. It is a 0.75 m CMOS processor with 431 pins, 150 MHz clock, 300 MIPS peak and 150 MFlops peak performance.  It is a multiprocessor system with cache coherence support and uses Alpha architecture.  It issues 2 instructions per cycle  It has a 64 bit Instruction unit an Floating Point unit, 128 bit data bus and 34-bit address bus.  There are 32 64-bit general purpose registers, 8 KB I-cache and D-cache and supports 64 bit virtual space and 43 bit address space.  There is an on-chip Floating Point unit and 32 64-bit floating point registers  It uses a 10 stage pipeline and follows IEEE and VAX FP standards for representation of floating point numbers THE VLIW ARCHITECTURE:

53. Mention two significant features found in VLIW architecture. Horizontal microcoding and superscalar processing ie., multiple functional units used with concurrently, large register files. 54. Why are VLIW processors called so ? The instruction length of a VLIW processor is hundreds of bits in length say 256 to 1024. So they are called so. 55. Expand VLIW. Very Long Instruction Word. 56. How can VLIW be made possible? Or how is VLIW implemented ? VLIW concept implements horizontal microcoding. Different fields of the long instruction word carry opcodes to be dispatched to different functional units. The operations to be simultaneously executed by the functional units are synchronized in a VLIW instruction. Multiple functional units work concurrently and share the use of a common large register file. Programs written in conventional short instruction words of 32 bits are compacted together by a compiler that can predict branch outcomes using elaborate heuristics or runtime statistics. 57. What are the drawbacks / limitations / demerits of VLIW architecture.  Performance of VLIW depends on efficiency in code compaction.  Code density is better only when more instruction level parallelism exists.  Architecture is totally incompatible with other processor types. So programs are not portable.  Different functional units have different latencies. So different implementations of VLIW architecture may not be binary compatible. 58. What are the advantages of a VLIW processor?  Run time resource scheduling and synchronization are eliminated(Wait is avoided)  Decoding of instructions is easy.  Low CPI may be achieved (depends on degree of parallelism that exists among instructions)  Elimination of hardware, software to detect parallelism at runtime.  Simple hardware structure and instruction set.  Performs well in predictable program behaviour applications (branch predictions) eg. scientific applications. 59. What is the difference between VLIW and vector processors. Random parallelism among scalar operations is exploited in VLIW processors. Regular / synchronous parallelism is exploited in vectorized super computers or in SIMD Computers.

60. Compare VLIW architecture processor and a superscalar processor. S.No VLIW PROCESSOR SUPERSCALAR PROCESSOR

Decoding of instructions is relatively 1. Decoding of instructions is easy difficult

Code density is better only when Code density is better when available 2. more instruction level parallelism instruction-level parallelism is less than exists that exploitable by the machine

If parallelism is lacking, more non- Superscalar processor essues only executable operations are 3. executable instructions. So no wastage inserted in the VLIW. This is a occurs waste VLIW machine exploiting different Object-code compatibility with other 4. amounts of parallelism needs non-parallel machines different instruction sets

Instruction parallelism and data Instruction parallelism and data 5. movement are specified at movement are analyzed at run time compile time and exploited

Run time resource scheduling and Dynamic resource scheduling leads to 6. synchronization are eliminated wait times. (wait is avoided)

CPI can be lower (program 7. dependent) Eg. Multiflow CPI may be higher due to wait times executes 7 operations

Suits only programs with 8. Suits all applications predictable branching behaviour.

VECTOR AND SYMBOLIC PROCESSORS

61. What is a Vector Processor? A Vector processor is a co-processor specially designed to perform vector computations. Vector instructions involve a large array of operands. 62. Where is a Vector processor used ? Vector processors are used in multi-pipelined supercomputers. 63. What are the two types of architectures used in a Vector processor? Register-to-register architecture and Memory-to-memory architecture 64. What is a Register-to-register architecture? This architecture is one that uses short instructions and vector register files. 65. What is Memory-to-memory architecture? This Architecture uses longer instructions and vector register files. Most instructions are memory-based and address memory locations. 66. Write a note on Register-based Vector Instructions. Register based vector instructions appear in Register-to-register Vector processors like Cray Super Computers. Let us denote a vector register of length n as Vi, a scalar register as Si, a memory array of length n as M(1:n). For vectors of the same length, we define vector operations (denoted by ) as follows: V1  V2  V3 ( binary vector ) s1  V1  V2 ( scaling ) V1  V2  s1 ( binary reduction ) M(1:n)  V1 (vector load ) V1  M(1:n) (vector store)  V1  V2 ( unary vector )  V1  s1 ( unary reduction) Vector operations are performed by dedicated pipeline units including functional pipelines and memory-access pipelines. Longer vectors are segmented to fit in the registers n elements at a time. 67. Write a note on Memory-based Vector Instructions. These are carried out by dedicated pipeline units. These contain functional pipelines and memory-access pipelines. They are found in memory-to-memory Vector processors like Cyber205. Examples of memory-based vector operations are: M1(1:n)  M2(1:n)  M (1:n) S1  M1(1:n)  M2(1:n)  M1(1:n)  M2(1:n) M1(1:n)  M2(1:n)  M(k) Where M(1:n) denotes a vector of length n and M(k) denotes a scalar. Long vectors are streamed using superwords. 68. Compare Vector processors and SIMD computer. In vector processors, we use efficient pipelining for vector processing. In SIMD computers, we use spatial or data parallelism for vector processing. 69. What distinguishes symbolic processing from other types of processing? Symbolic processing is used in places where we meet complexity in data and knowledge representation, primitive operations, algorithmic behaviour, memory storage, I/O and communications and other architectural differences. It deals with logic programs, symbol lists, objects, scripts, Bulletin boards, production systems, semantic networks, frames and Artificial Neural Networks. 70. What are the other names by which symbolic processors are denoted? Prolog processors, Lisp processors, symbolic Manipulators. 71. Where are symbolic processors applied? Theorem Proving, Pattern Recognition, Expert Systems, Knowledge engineering, Text retrieval, cognitive science, and machine intelligence. 72. What are the characteristics of Symbolic processing ? (Refer to table in page 186 of text book) 73. Give an outline of Vector pipelines in Vector processors.  Vector pipelines can be attached to any scalar processor, superscalar processor or super pipelined processor.  Vector processors take advantage of unrolled loop level parallelism  An optimizing compiler vectorizes sequential code before execution on a vector pipelie.  It eliminates software overhead in looping control.  Each vector instruction executes a string of operations, one for each element in the vector. 74. How does a Lisp program implement parallelism? A Lisp program is a set of functions in which data are passed from function to function. The concurrent execution of these functions forms the basis for parallelism. Lisp needs efficient stack based computing for recursive function calling. Linked lists are used to store data dynamically. Automatic garbage collection is used for efficient data handling. 75. What are the primitive operations for Artificial Intelligence Applications? Search, compare, logical inferencing, pattern matching, unification, filtering, context search and retrieval, set operations, transitive closure and reasoning. 76. Describe the architecture of symbolics 3600 Lisp processor. It uses a stack oriented architecture with 2 layers. The first layer uses pure stack model to simplify instruction set design. The second layer uses stack for implementation of operations. Stack buffer and scratch pad memories are implemented as fast caches to main memory. It executes Lisp instructions in one machine cycle. Instructions that can be executed in parallel are floating point addition, fixed point addition, data type checking and garbage collection. Registers in CPU e m i

MEMORY HIERARCHY TECHNOLOGY t

s t s i e

Cache n c u c

77. What are the components of a (sRams) r

e d p

a memory hierarchy? n t a s Main memory y o

Storage devices such as t c i (dRams) c n a i

registers, caches, main p e a s c a memory, disk devices and e n Disk storage (Solid state, r i

c e

tape units. Magnetic) n s I a e

78. What are the r c n characters/factors that I Tape Units (Magnetic determine the level of memory tapes, Optical disks) technology and storage organization in the memory Capacity hierarchy? Access time, memory size, cost per byte, transfer bandwidth, and unit of transfer. 79. Define access time.

The access time ti is the round-trip time from the CPU to the ith-level memory. 80. Define memory size.

The memory size si is the number of bytes or words in level i. 81. What is the cost of the ith level memory?

The cost of the ith level memory is the product cisi 82. Define bandwidth of memory hierarchy.

The bandwidth bi refers to the rate at which information is transferred between adjacent levels. 83. Define grain size.

The unit of transfer xi refers to the grain size for data transfer between levels i and i+1. 84. What are the general characteristics of the memory devices in the memory hierarchy? Memory devices at a lower level are faster to access, smaller in size and more expensive per byte, having a higher bandwidth and using a smaller unit of transfer as compared with those at a higher level.

85. Tabulate the values of different memory traits for different memory levels. Memory Main Registers Cache Disks Tape characteristics memory

1-Gbyte 5 GB Device 256 K-Bit ECL 4M-bit DRAM magnetic disk Magnetic technology SRAM unit tape unit

2-20 min Access time 10 ns 25-40 ns 60 – 100 ns 12-20 ms (search time)

Capacity(bytes) 512 bytes 128 KB 512 Mbytes 60-228 GB 512 GB 2 TB

Cost (cents/KB) 18,000 72 5.6 0.23 0.01

Bandwidth(in 400-800 250-400 80-133 3-5 0.18-0.23 MB/s)

4-8 bytes per 32 bytes per 0.5-1 Kbytes 5-512 KB per Backup Unit of transfer word block per page file storage

Allocation Compiler Hardware Operating Operating Operating management assignment control System System/user System / user 86. Which is the fastest type of memory? What are its characteristics? The register and cache memory are the fastest types of memory. They are found on the processor. Cache may be built on the processor chip or on the board. Register operations are carried out under the control of the processor, at processor speed ie., one clock cycle. The cache operations are controlled by the memory management unit and are known to the programmer. Cache may be implemented at multiple levels based on speed and application requirements. 87. Describe the characteristic features of different levels of memory in the memory hierarchy. Cache Memory: Cache memory is a very fast memory next to registers. It is used to store frequently used data. It is controlled by the Memory Management unit and may be of several levels. Main Memory: Main memory is also called the primary memory. It is larger than cache memory. It is implemented using DRAM chips. It is managed by the MMU and OS. It may be extended by adding more memory boards and it is divided into two sublevels using different memory technologies. Disk Drives and Tape Units: These are handled by the OS with limited user intervention. Disk holds system programs and user programs and their data sets. They are the highest level of on-line memory. Magnetic tapes are off-line memory devices used to backup current and past user programs. They also store processed results and files. Peripheral Technology: Peripheral devices like printers, plotters, terminals, monitors, graphics displays, optical scanners, image digitizers, output microfilm devices are used by special purpose multimedia applications. They are handled by special device driver programs. They are used to input or output special data types like image, speech, video and sonar data. 88. What are the three important properties satisfied by information stored in a memory hierarchy? Inclusion, Coherence and Locality 89. Which property of memory hierarchy aids in effective utilization of memory? Locality property 90. What forms the virtual address space of the computer?

The collection of all addressable words in Mn forms the virtual address space of the computer. 91. Explain inclusion property of memory hierarchy.

The inclusion property is stated as M1  M2  M3 …..  Mn. That is all information items found at one level of memory are surely found on the next higher level of

memory. All programs and data are stored on Mn. During processing, subsets of Mn are copied into Mn-1. Similarly subsets of Mi are copied into Mi-1. Hence the inclusion property holds good at any time. When a data item is not found (miss ) at level i then it will not be found at all levels lesser than i.

92. How are information at various memory levels organized? Information in the CPU and cache are handled and organized in terms of words(4 to 8 bytes). Cache may be divided into blocks of 32 bytes. The main memory is organized in terms of pages of 4 KB each. Pages are units of information transferred between disk and main memory. Scattered pages are organized as segments. The size of a segment varies based on user’s needs. Information on disks and tapes are organized as files. 93. What is memory coherence? In order to minimize the effective access time of the memory hierarchy and also optimize the utilization of memory resources at the same time, we store the entire code and data belonging to a single job on the secondary storage area such as disk/tape. Then portions of code and data are brought to main memory as and when required from the disk. Similarly frequently accessed data is stored on the cache and currently used data on registers. Hence when a value of a variable/field is modified during execution, the modification has to be propagated back to all the copies of the variable/field stored at different levels of the memory hierarchy, especially on secondary storage. Only then all the copies will be consistently the same, so that there will not be any conflict or vagueness when the value is used again in the future. This property of consistency of all copies of the same information item at successive memory levels is called COHERENCE.

94. How is memory coherence maintained/implemented ? There are two techniques to ensure memory coherence:

 Write Through: In this method, whenever a word is modified at memory level Mi, an

immediate updation is carried out in memory at level Mi+1, for i=1 to n-1.

 Write Back: Updation of the word at level Mi is implemented at level Mi+1 only when

the word is to be removed from Mi. 95. What is Locality of Reference? It has been found that a typical program may spend 90% of its execution time on only 10% of the code eg. Inner loops. Hence memory references to code or data may tend to access the same code/data or code/data adjacent to a particular region frequently. This is called as locality of reference. 96. What are the three types of locality of reference? Explain. There are three dimensions to the locality property: temporal, spatial and sequential, based on page reference patterns. Temporal Locality: Recently referenced code/data are likely to be referenced again in near future (eg. In iterative loops, process stacks, temporary variables etc. a code segment is referenced again and again.) This is called temporal locality. Spatial Locality: When we operate on tables/arrays we tend to process the same set of adjacent values again and again. Array values are stored consecutively in memory. Program instructions are also stored in consecutive memory locations. When they are executed in sequence, we refer to successive memory locations Thus addresses near to one another are accessed repeatedly. This is called Spatial Locality. Sequential Locality: As program instructions are mostly executed in sequence, successive instructions on a page are accessed one after another. Branching, which is a rare phenomenon causes instructions from a different location on the same page or a different page to be accessed from memory. This happens relatively lesser in ordinary programs. Array elements are also accessed in sequential order most of the time. Hence successive memory words will be accessed. This phenomenon of accessing sequential memory locations is called sequential locality. 97. What are the effects of Locality of Reference? Due to such locality properties, the processor tends to access the same page or block most of the time. This is the basis for the use of a memory hierarchy as required pages and blocks alone can be stored for program execution. Only when the requested data/code is not available on the current page/block, a new page/block containing the required data/code has to be loaded into main memory. Due to locality of reference this happens less frequently. Each locality property is used to determine different traits of the memory hierarchy. For example Temporal Locality forms the basis for ‘Least Recently Used’ page replacement policy. It is also used to determine the size of memory at successive levels. Spatial Locality is used to determine grain size ie., size of unit data transfers between adjacent memory levels. Sequential Locality is used to determine the grain size for optimal scheduling. All the three determine the amount of data/code to be pre- fetched, design of the cache memory, Memory management policies and Virtual memory organization. 98. What are working sets? The subset of addresses/pages referenced within a given time period is called the working set. 99. What is the significance of working sets? During program execution, working set changes slowly and is mostly continuous due to locality of reference. Working set is accumulated at the lowest level (eg.cache) and so this reduces the effective memory access time (Instruction Issue Latency) and gives a higher hit ratio. The time period of the working set is set by the OS and greatly influences the size of the working set. Based on the working set pattern, the optimal cache size is determined.

MEMORY CAPACITY PLANNING: 100.How do you assess the performance of memory? The performance of a memory hierarchy is determined by the effective access time

Teff at any level in the hierarchy. This depends on the hit ratio and access frequencies at successive levels. 101.What is hit ratio? Hit ratio is a concept defined for any two adjacent levels of memory hierarchy. The

hit ratio hi at Mi is the probability that an information item will be found in Mi. 102.What is miss ratio? Hit ratio is a concept defined for any two adjacent levels of memory hierarchy.

Consider memory levels Mi and Mi-1 in a hierarchy, i=1,2,…,n. The hit ratio hi at Mi is the

probability that an information item will be found in Mi. The miss ratio at Mi is defined

as 1-hi. 103.What are hit and miss?

When an information item is found in Mi we call it a hit. Otherwise it is a miss. 104.What are the various factors that determine the hit ratios at successive levels of the memory hierarchy? Memory capacities at each level, memory management policies and program behavior determine the hit ratios at successive levels of the memory hierarchy. 105.What is the value taken by the hit ratio? Hit ratios at each level are independent random variables with values between 0

and 1. We assume h0=0 and hn=1, which means that CPU always accesses M1 first and access to the outermost memory is always a hit. 106.Define access frequency.

The access frequency to Mi is defined as fi=(1-h1)(1-h2)….(1-hi-1)hi. This means that,

there were i-1 misses at the lower levels and a hit at Mi. Also note that i=1 to n fi = 1. 107.How does locality of reference and access frequency relate to each other? Due to locality of reference, almost 90% of the time, we are able to find the

required code/data in the cache itself, or in main memory. Hence the hit ratio at h1≈1 or more than 0.75, most of the time. The access frequencies decrease very rapidly from

low to high levels as we move down the hierarchy ie., f1 » f2 » f3 » ….. fn. 108. Write a note on Effective Access Time of the memory hierarchy. Effective Access Time is the factor that indicates the performance of the memory hierarchy. The objective of building a memory hierarchy is to achieve as high a hit ratio

as possible at M1.. Every time a miss occurs, the next level is accessed for the required data. The miss at the cache level is called a block miss and at the main memory level is called a page fault. Higher the level, more is the latency for access.

Using the access frequencies fi for i = 1 to n, we can formally define the effective access time of a memory hierarchy as follows:

Teff = i=1to n fi.ti = h1t1 + (1-h1)h2t2 + (1-h1)(1-h2)h3t3 + ….+ j=1 to n-1 (1-hj)tn

The first several terms dominate. However Teff also depends on the program behavior and memory design choices. Only after extensive program trace studies can one

estimate the hit ratios and the value of Teff more accurately. 109.Write short notes on optimization of the memory hierarchy. The total cost of a memory hierarchy is estimated as follows:

Ctotal =  i=1 to n ci.si Since c1 > c2 > c3 > ….> cn we have to choose s1 < s2 < s3 < …. < sn. The optimal design of

a memory hierarchy should result in a Teff close to the t1 of M1 and a total cost close to

the cn of Mn. In reality this is difficult to achieve due to the tradeoffs among n levels. Therefore the optimization problem can be stated as to

Min Teff =  i=1 to n fi.ti subject to the constraints si > 0, ti > 0 for i=1,2,…,n and

Ctotal =  i=1 to n ci.si < C0 The unit cost and capacity at each level depend on the speed required. Therefore, the

optimization involves tradeoffs among ti , si, ci and fi or hi at all levels.

VIRTUAL MEMORY TECHNOLOGY:

110. Explain the concept of Virtual memory and justify the need for it. The main memory of the computer is limited in capacity. It limits the maximum size of a user’s program as it has to hold the entire program due to execution. This is one limitation that was faced in the early days of computing. When the resources of a computer were allocated to only one user at a time, programs had to be executed in batches. This led to long wait times and under- utilization of resources. Moreover, users preferred to run more than one program at a time. This led to the concept of multiprogramming and time-shared operating systems. In such a multi-programmed environment, more than one program is loaded into memory. To improve on performance, programs were divided into segments or pages and the currently executed pages or segments were alone loaded into main memory. Thereby, pages belonging to many user programs could be stored at a time, for multi- programmed execution. When an unloaded page is required for execution, it can be brought from secondary storage and loaded in the space occupied by a relatively inactive page. This is called page replacement. Loading of required pages into memory, and replacement of inactive pages by active ones are carried out efficiently by the operating system. This technique allowed user programs to be of any size. Also, users were given an idea that the main memory was very large, large enough to hold any number of programs of any size. This large memory is called the virtual memory. 111.What do you mean by address space? What are the types of address space? Explain each. Address space comprises of the total number of words that can be individually addressed in memory. There are two types of address spaces: Physical and Virtual. Each word in the physical memory is identified by a unique physical address. All memory words in the main memory form a physical address space. Under the virtual memory management technique, virtual addresses are generated by the processor during compile time(We assume that addresses start at 0 and extend up to as many words as we want for storing code and data). The virtual addresses must be translated into physical addresses at run time(Now the code and data will be loaded into some portion of physical memory. So we have to convert the virtual address into the physical address that holds the code and data). A system of translation tables and mapping functions are used in this process. The address translation and memory management policies are affected by the virtual memory model used and by the organization of the disk arrays and of the main memory. 112.What do you mean by relocatability? How is it facilitated? Relocatability means that a program segment/page can be loaded into any portion of physical memory or shifted in physical memory to a new location as need arises(during compaction). This is facilitated by assigning virtual addresses to code and data during compilation. Virtual addresses can be mapped into the correct physical addresses through address translation at run time. 113.What is the advantage of using virtual memory space? What is the overhead incurred in turn? The use of virtual memory facilitates sharing of the main memory by many users on a dynamic basis. It also facilitates software portability and allows users to execute programs requiring much more memory than the physical memory. It allows program relocatability and makes it possible to implement protection in the OS kernel and optimization of memory management and allocation. 114.Write a note on address mapping. Let V be the set of virtual addresses generated by a translator program running on a processor. Let M be the set of physical addresses allocated to run this program. A virtual memory system demands an automatic mechanism to implement the following

mapping : ft : V  M  {  } This mapping is a time function which varies from time to time because the physical memory is dynamically allocated and deallocated. Consider any virtual address v  V.

The mapping ft is formally defined as follows: M, if m  M has been allocated to store

Ft(v) = the data identified by virtual address v , if data v is missing in M

In other words, the mapping ft(v) uniquely translates the virtual address v into a physical address m if there is a memory hit in M. When there is a memory miss, the

value returned ft(v)= , signals that the referenced item has not brought into the main memory yet. 115.What are the implications of address translation? The efficiency of the address translation process affects the performance of the virtual memory. Processor support is needed for precise interrupts and translator updates. Virtual memory is more difficult to implement in a multiprocessor, where additional problems such as coherence, protection, and consistency become much more involved. 116.What are the two types of virtual memory models? Private virtual memory model and shared virtual memory model. 117.What is private virtual memory? What are its advantages and disadvantages? Each processor has a private virtual memory space. This is divided into pages. Virtual pages from different virtual spaces are mapped into the same physical memory shared by all processors. This model is followed in VAX/11 processor and most UNIX systems. The advantages of using this model include the use of a small processor address space (32 bits), protection on each page or on a per-process basis and the use of private memory maps, which require no locking. The shortcoming lies in the synonym problem, in which different virtual addresses in different/same virtual spaces point to the same physical page. Another problem is that the same virtual address in different virtual spaces may point to different pages in the main memory. 118.What is shared virtual memory? What are its advantages and disadvantages? This model combines all the virtual address spaces into a single globally shared virtual space. Each processor is given a portion of the shared virtual memory to declare their addresses. Different processors may use disjoint spaces. Some areas of virtual space can also be shared by multiple processors. Examples of machines using shared virtual memory include IBM801, RT, RP3, System 38, the HP Spectrum, the Stanford Dash, MIT Alewife, Tera etc. The advantages in using shared virtual memory include the fact that all addresses are unique. However, each processor must be allowed to generate addresses larger than 32 bits, such as 46 bits for a 64 TB address space. Synonyms are not allowed in a globally shared virtual memory. The page table must allow shared accesses. Therefore, mutual exclusion is needed to enforce protected access. Segmentation is built on top of the paging system to confine each process to its own address space (segments). Global virtual memory makes the address translation process even longer. 119.What is the purpose of memory allocation? The purpose of memory allocation is to allocate pages of virtual memory to the page frames of the physical memory. 120.What is address translation mechanism? This is the process of translation of virtual addresses into physical addresses. Various schemes are available. The translation uses translation maps which can be used in various ways. Translation maps are stored in the cache, in associative memory or in the main memory. To access these maps, a mapping function is applied to the virtual address. The function generates a pointer to the desired translation map. This mapping can be implemented with a hashing or congruence function. 121.What is hashing? How is it used in address translation? Hashing is a simple computer technique for converting a long page number into a short one with fewer bits. The hashing function randomizes the virtual page number into a unique hash value that points into a mapping table entry, which points to the corresponding page frame. 122.Describe the need for translation look-aside buffer. During execution of multiple user programs, the main memory is loaded with pages that contain the currently required code and data for each program. Every time, the processor needs to access code or data from the current page, it addresses memory with a virtual address. This address has to be translated into the equivalent physical address. This translation is carried out by the address translation mechanism. This constitutes an additional overhead for every memory access and may influence the memory access time. Therefore an improvement in the time taken for address translation will be well appreciated in this case. This can be achieved by storing the page frame number of frequently accessed pages against the virtual page number in a faster cache / associative memory, so that translation will be quicker. This faster memory is called the TRANSLATION LOOK-AHEAD BUFFER. When the required virtual page number is not available in the TLB cache(miss), (Whenever currently running programs get blocked and a new program is scheduled or when an already scheduled and running program needs code/data from a page that is not loaded in memory, due to a branch or subroutine call, and the main memory is full, there arises a need to replace a page in main memory by the required page from secondary storage, according to some replacement policy.) we have to load the page from secondary storage into main memory. Then this page data is included in the Translation Look ahead buffer by updating it. 123.Explain address translation using Translation Look aside Buffer.

Translation maps appear in the form of a translation look-aside buffer and page tables. Based on the principle of locality in memory references, a particular working set of pages is referenced within a given context or time window. The TLB is a high speed lookup table which stores the most recently or likely referenced page entries. A page entry consists of essentially a (virtual page number, page frame number) pair. It is hoped that pages belonging to the same working set will be directly translated using the TLB entries.

The use of a TLB and PTs for address translation is shown in Fig(see next page). Each virtual address is divided into three fields. The leftmost field holds the virtual page number, the middle field identifies the cache block number, and the rightmost field is the word address within the block. Our purpose is to produce the physical address consisting of the page frame number, block number, and the word address. The first step of the translation is to use the virtual page number as a key to search through the TLB for a match. The TLB can be implemented with a special associative memory or use part of the cache memory/ In case of a match in the TLB, the page frame number is retrieved from the matched page entry. The cache block and word address are copied directly. In case the match cannot be found in the TLB, a hashed pointer is used to identify one of the page tables where the desired page frame number can be retrieved. 124.Explain Paged memory management.

Paging is a technique for partitioning both the physical memory and virtual memory into fixed size pages. Exchange of information between them is conducted at the page level as described before. Page tables are used to map between pages and page frames. These tables are implemented in the main memory upon creation of user processes in application programs. Since many user processes may be created dynamically, the number of PTs maintained in the main memory can be very large. The Page Table Entries are similar to the TLB entries, containing essentially (virtual page, page frame) address pairs. TLB and PTEs should be dynamically updated to reflect the latest memory reference history. Only snapshots of the history are maintained in these translation maps. If the demanded page cannot be found in the PT a page fault is declared. A page fault implies that the referenced page is not resident in the main memory. When a page fault occurs, the running process is suspended. A context switch is made to another ready-to-run process while the missing page is transferred from the disk or tape unit to the physical memory.

This direct page mapping can be extended with multiple levels of page tables. Multilevel paging takes a longer time to generate the desired physical address because multiple memory references are needed to access a sequence of page tables. The reason of multilevel paging is to expand the memory space and to provide more sophisticated protection of page access at different levels of the memory hierarchy.

125.Explain segmented memory management.

A large number of pages can be shared by segmenting the virtual address space among multiple user programs simultaneously. A segment of scattered pages is formed logically in the virtual memory space. Segments are defined by users in order to declare a portion of the virtual address space. In a segmented memory system, user programs can be logically structured as segments. Segments can invoke each other. Unlike pages, segments can have variable lengths. The management of a segmented memory system is much more complex due to the non-uniform segment size. Segments are a user- oriented concept, providing logical structures of programs and data in the virtual address space. On the other hand, paging facilitates the management of physical memory. In a paged system, all page addresses form a linear address space within the virtual space.

The segmented memory is arranged as a two dimensional address space. Each virtual address in this space has a prefix field called the segment number and a postfix field called the offset within the segment. The offset addresses within each segment form on dimension of the contiguous addresses. The segment numbers, not necessarily contiguous to each other, form the second dimension of the address space.

126.What are Paged Segments?

The concepts of paging and segmentation can be combined to implement a type of virtual memory with paged segments. Within each segment, the addresses are divided into fixed size pages. Each virtual address is thus divided into three fields. The upper field is the segment number, the middle one is the page number, and the lower one is the offset within each page. Paged segments offer the advantages of both paged memory and segmented memory. For users, program files can be better logically structured. For the OS, the virtual memory can be systematically managed with fixed size pages within each segment. Tradeoffs do exist among the sizes of the segment field, the page field, and the offset field. This sets limits on the number of segments that can be declared by users, the segment size (the number of pages within each segment), and the page size.

127. What is the need for inverted Paging ?

The direct paging works well with a small virtual address space such as 32 bits. In modern computers, the virtual address space is very large, such as 52 bits in the IBM RS/6000. A large virtual address space demands either large PTs or multilevel direct paging which will slow down the address translation process and thus lower the performance. Besides direct mapping, address translation maps can also be implemented with inverted mapping.

128. Mention some processors that use inverted Paging.

The IBM 801 prototype and subsequently the IBM RT/PC have implemented inverted mapping for page address translation.

129.Describe Inverted Paging.

An Inverted page table is created for each page frame that has been allocated to users. Any virtual page number can be paired with a given physical page number.

Inverted page tables are accessed either by an associative search or by the use of a hashing function. In using an inverted PTE, only virtual pages that are currently resident in physical memory are included. This provides a significant reduction in the size of the page tables.

The generation of a long virtual address from a short physical address is done with the help of segment registers. The leading 4 bits of a 32 bit address name a segment register. The register provides a segment id that replaces the 4 bit sreg to form a long virtual address. This effectively creates a single long virtual address space with segment boundaries at multiples of 256 MB. The IBM RT/PC has a 12 bit segment id and a 40 bit virtual address space. Either associative page tables or inverted page tables can be used to implement inverted a mapping. The inverted page table can also be assisted with the use of a TLB. An inverted PT avoids the use of a large page table or a sequence of page tables. Given a virtual address to be translated, the hardware searches the inverted PT for that address and if it is found, uses the table index of the matching entry as the address of the desired page frame. A hashing table is used to search through the inverted PT. The size of an inverted PT is governed by the size of the virtual space. Because of limited physical space, no multiple levels are needed for the inverted page table.

130.Explain the process of paging and segmentation in the Intel i486 processor. The i486 supports both paging and segmentation. Modes of Operation: There are four levels of protection. The physical memory is 4 GB in size. The maximal memory size in real mode is 1 MB. Protected mode allows software from existing 8086, 80286, and 80386 to be run. It increases the linear address from 4 GB to 64 TB. Segmentation: A segment can vary in length from 1 byte to 4 GB; it can start at any base address and overlapping segment addresses are allowed. When a 4 GB segment is selected, the entire physical memory becomes one large segment, which means segmentation is disabled. The virtual address has a 16 bit segment selector (Fig a) to determine the base address of the linear address space to be used with the i486 paging system. The 32 bit offset specifies the internal address within a segment. The segment descriptor is used to specify access rights and segment size besides selection of the address of the first byte of the segment. Paging : Paging is optional. It can be enabled or disabled under software control. When enabled, the virtual address is first translated into a linear address and then into the physical address. When disabled, the linear address and physical address are the same. The standard page size on the i486 is 4 KB. Four control registers are used to select between regular paging and page fault handling. The page table directory (4 KB) allows 1024 page directory entries (Fig c). Each page table at the second level is 4 KB and holds up to 1024 PTEs. The upper 20 linear address bits are compared to determine if there is a hit. The hit ratios of the TLB and of the page tables depend on program behavior and the efficiency of the page replacement policies. A 98% hit ratio has been observed in TLB operations in the past. Pure Physical Addressing: A 32-entry TLB (Fig b) is used to convert the linear address directly into the physical address without resorting to the two-level paging scheme. Different memory organizations supported: Pure paging, Pure segmentation, Segmented Paging, Pure Physical addressing without paging and segmentation. 131.Explain the concept of memory replacement policy. We know that main memory is partitioned into pages and each partition contains the currently active pages of processes currently being executed concurrently by the processor. During execution, the processor may demand for a data/code item that may not be present in the currently active page in memory. This is termed as page fault. As a result, the page that contains the respective data/code item is searched at the next memory level in the hierarchy, until it is found. Then the page that contains the required data/code item is loaded into the successive lower levels until it is made available to the processor. That is the required page is loaded into main memory. But there must be enough free space in main memory to load the page. If not, an already existing page in memory will be deallocated and the new page will be replaced in that memory partition. Which is the page to be deallocated is determined based on a replacement policy. Therefore, memory management policies include the allocation and deallocation of memory pages to active processes and the replacement of memory pages. 132.Define page replacement. Page replacement refers to the process in which a resident page in main memory is replaced by a new page transferred from the disk. 133.Which is a good page replacement policy? The goal of a page replacement policy is to minimize the number of possible page faults so that the effective memory access time can be reduced. A good policy should match the program locality property. 134.What are the factors that affect the performance of a page replacement policy? The effectiveness of replacement algorithm depends on the program behavior and memory traffic patterns encountered. The policy is also affected by page size and by the number of available page frames. 135.What do you mean by a page trace? What is its use? A page trace is a sequence of page frame numbers generated during the execution of a given program. It is used to analyze the performance of a paging memory system. 136.What is the use of a page trace? A page trace is used to analyze the performance of a paging memory system by computing the hit ratio. 137.How do you detect a page hit or page fault? Each Page Frame Number (PFN) corresponds to the prefix portion of a physical memory system. The occurrence of page hits or page faults can be determined by tracing the successive PFNs in a page trace against the resident page numbers in the page frames. If the required page is already resident in a memory page, we have a page hit. Otherwise we have a page fault. 138. What happens when a page fault occurs? A page fault occurs when all the page frames are already occupied by some page other than the one required. Hence, the required page has to be brought from the next memory level and loaded in some page frame. The replacement policy determines which page in the current level should be replaced by the newly loaded page. 139.Describe in detail various page replacement policies under demand paging. Consider a page trace P(n) = r(1)r(2)…r(n) consisting of n PFNs requested in discrete time from 1 to n, where r(t) is the PFN requested at time t. The forward distance ft(x) for a page x is the number of time slots from time t to the

first repeated reference of page x in the future. Similarly a backward distance bi(x) is the number of time slots from time t to the most recent reference of page x in the past.

That is ft(x) = k, if r(t+k) = r(t) = x in P(n) else  if x does not reappear in P(n) and

bt(x) = k, if r(t-k) = r(t) = x in P(n) else  if x never appeared in P(n) in the past. Let R(t) be the resident set of all pages residing in main memory at time t. Let q(t) be the page to be replaced from R(t) when a page fault occurs at time t. i. Least recently used(LRU) This policy replaces the page in R(t) which has the longest backward distance. (ie., it replaces the page that has been least recently used among all pages)

q(t) = y iff bt(y) = max x  R(t) { bt(x)} ii. Optimal algorithm (OPT) This policy replaces the page in R(t) which has the longest forward distance. (ie., it replaces the page that will not be used in future among all pages)

q(t) = y iff ft(y) = max x  R(t) { ft(x)} iii. First-in-First-Out (FIFO) This policy replaces the page in R(t) which has been in memory for the longest time. iv. Least Frequently used (LFU) This policy replaces the page in R(t) which has been least referenced in the past v. Circular FIFO This policy joins all the page frame entries into a circular FIFO queue. A pointer is used to point to the front of the queue. An allocation bit, associated with each page frame, is set upon initial allocation of a page to the frame. When a page fault occurs, the queue is circularly scanned from the pointer position. The pointer skips and resets the allocated page frames and replaces the very first unallocated page frame. When all frames are allocated, the front of the queue is replaced, as in the FIFO policy. vi. Random replacement Any page is chosen for replacement at random. 140.Are there any policies for management of cache memory? Give details. The relationship between the cache block frames and cache blocks is similar to that between page frames and pages on a disk. Therefore, those page replacement policies can be modified for block replacement when a cache miss occurs. Different cache organizations may offer different flexibilities in implementing some of the replacement algorithms. The cache memory is often associatively searched, while the main memory is randomly addressed. Due to the difference between page allocation in main memory and block allocation in the cache, the cache hit ratio and memory page hit ratio are affected by the replacement policies differently. Cache traces are often needed to evaluate the cache performance. 141.Give one example to showcase the performance of various page replacement policies.

Assume we have two levels in our memory hierarchy – main memory M1 and

disk memory M2. Let us assume the page size to be 4 words, the number of page

frames in M1 is 3, labeled a, b, c and the number of pages in M2 is 10 identified by

0,1,..,9. The ith page in M2 consists of word addresses 4i to 4i+1 for all i=0 to 9. Assume the following sequence of word addresses which are grouped if they belong to the same page. The sequence of page numbers so formed is the page trace:

Word trace: [0,1,2,3,][ 4,5,6,7,] [ 8, ] [ 16,17, ] [ 9,10,11, ] [ 12, ] [ 28,29,30, ] [ 8,9,10,] [ 4,5,] [ 12,] [ 4,5 ] Page trace : 0 1 2 4 2 3 7 2 1 3 1

PF 0 1 2 4 2 3 7 2 1 3 1 Hit Ratio a 0 0 0 4 4 4 7 7 7 3 3 b 1 1 1 1 3 3 3 1 1 1 LRU 3 / 11 c 2 2 2 2 2 2 2 2 2 Faults * * * * * * * * a 0 0 0 4 4 3 7 7 7 3 3 b 1 1 1 1 1 1 1 1 1 1 OPT 4 / 11 c 2 2 2 2 2 2 2 2 2 Faults * * * * * * * a 0 0 0 4 4 4 4 2 2 2 2 b 1 1 1 1 3 3 3 1 1 1 FIFO 2 / 11 c 2 2 2 2 7 7 7 3 3 Faults * * * * * * * * *

Page tracing experiments are described below for three replacement policies: LRU, OPT and FIFO. The successive pages loaded in the page frames form the trace entries. Initially all page frames(PFs) are empty. The above results indicate the superiority of the OPT policy over the others. However it cannot be implemented in practice. The LRU policy performs better than the FIFO due to the locality of references. From these results, we realize that the LRU is generally better than FIFO. However results also depend on program behavior. 142. Analyze the performance of various page replacement policies. The performance of a page replacement algorithm depends on the page trace (program behavior) encountered. The best policy is the OPT replacement algorithm. However, the OPT replacement is not realizable because no one can predict the future page demand in a program. The LRU algorithm is a popular policy and often results in a high hit ratio. The FIFO and random policies may perform badly because of violation of the program locality. The circular FIFO policy attempts to approximate the LRU with a simple circular queue implementation. The LFU policy may perform between the LRU and the FIFO policies. However there is no fixed superiority of any policy over the others because of the dependence on program behavior and run-time status of the page frames. In general, the page fault rate is a monotonic decreasing function of the size of the resident set R(t) at time t because more resident pages result in a higher hit ratio in the main memory.