IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

Investigation into Gang by integration of in Multi core processors

Authors Rupali1, Shailja Kumari2 1Research Scholar, Department of CSA, CDLU Sirsa 2Assistant Professor, Department of CSA, CDLU Sirsa, Email- [email protected]

ABSTRACT Objective of research is increase efficiency of scheduling dependent using enhanced multithreading. gang scheduling of parallel implicit-deadline periodic task systems upon identical multiprocessor platforms is considered. In this scheduling problem, parallel tasks use several processors simultaneously. first algorithm is based on linear programming & is first one to be proved optimal for considered gang scheduling problem. Furthermore, it runs in polynomial time for a fixed number m of processors & an efficient implementation is fully detailed. second algorithm is an approximation algorithm based on a fixed- priority rule that is competitive under resource augmentation analysis in order to compute an optimal schedule pattern. Precisely, its factor is bounded by (2−1/m). In architecture, multithreading is ability of a (CPU) or a single core within a multi-core to execute multiple processes or threads concurrently, appropriately supported by . This approach differs from , as with multithreading processes & threads have to resources of a single or multiple cores: computing units, CPU caches, & translation lookaside buffer (TLB). Multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using -level as well as instruction-level parallelism. Keywords: TLP, Response Time, Latency, throughput, multithreading, Scheduling threaded processor would execution to 1. INTRODUCTION another thread that was ready to run. Only when The multithreading paradigm has become more data for previous thread had arrived, would popular as efforts to further exploit instruction- previous thread be placed back on list of ready-to- level parallelism have stalled since late 1990s. run threads. This allowed concept of throughput computing to Conceptually, it is similar to cooperative multi- re-emerge from more specialized field of tasking used within real-time operating systems, transaction processing; even though it is very within which tasks voluntarily give up execution difficult to further speed up a single thread or time when they need to wait upon some type of single program, most computer systems are event. This type of multithreading is known as actually multitasking among multiple threads or block, cooperative or coarse-grained programs. Thus, techniques that improve multithreading. The goal of multithreading throughput of all tasks result within overall hardware support is to allow quick switching performance gains. The simplest type of between a blocked thread & another thread ready multithreading occurs when one thread runs until to run. To achieve this goal, hardware cost is to it is blocked by an event that normally would replicate program visible registers, as well as create a long-latency stall. Such a stall might be a some processor control registers. Switching from cache miss that has to access off-chip memory, one thread to another thread means hardware that might take hundreds of CPU cycles for data to from using one register set to another; to return. Instead of waiting for stall to resolve, a switch efficiently between active threads, each

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1729

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

active thread needs to have its own register set. time-sliced multithreading are more modern For example, to quickly switch between two terminology. threads, register hardware needs to be instantiated In addition to hardware costs discussed within twice. Additional hardware support for block type of multithreading, interleaved multithreading allows thread switching to be done multithreading has an additional cost of each within one CPU cycle, bringing performance stage tracking thread ID of instruction it improvements. Also, additional hardware allows is processing. Also, since there are more threads each thread to behave as if it were executing alone being executed concurrently within pipeline, & not sharing any hardware resources with other shared resources such as caches & TLBs need to threads, minimizing amount of software changes be larger to avoid thrashing between different needed within application & operating system to threads. support multithreading. Simultaneous multithreading Many families of & embedded The most advanced type of multithreading applies processors have multiple register banks to allow to superscalar processors. Whereas a normal quick context switching for . Such issues multiple instructions schemes could be considered a type of block from a single thread every CPU cycle, within multithreading among user program thread & simultaneous multithreading (SMT) a superscalar threads. processor could issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction- level parallelism, this type of multithreading tries to exploit parallelism available across multiple threads to decrease waste associated with unused issue slots. For example: Fig 1 Multi thread Excecution 1. Cycle i: instructions j & j + 1 from thread The Objective of interleaved multithreading is to A & instruction k from thread B are remove all data dependency stalls from execution simultaneously issued. pipeline. Since one thread is relatively 2. Cycle i + 1: instruction j + 2 from thread independent from other threads, there is less A, instruction k + 1 from thread B, & chance of one instruction within one pipelining instruction m from thread C are all stage needing an output from an older instruction simultaneously issued. within pipeline. Conceptually, it is similar to 3. Cycle i + 2: instruction j + 3 from thread A preemptive multitasking used within operating & instructions m + 1 & m + 2 from thread systems; an analogy would be that time slice C are all simultaneously issued. given to each active thread is one CPU cycle. To distinguish other types of multithreading from For example: SMT, term "" is used to 1. Cycle i + 1: an instruction from thread B is denote when instructions from only one thread issued. could be issued at a time. 2. Cycle i + 2: an instruction from thread C is In addition to hardware costs discussed for issued. interleaved multithreading, SMT has additional This type of multithreading was first called barrel cost of each pipeline stage tracking thread ID of processing, within which staves of a barrel each instruction being processed. Again, shared represent pipeline stages & their executing resources such as caches & TLBs have to be sized threads. Interleaved, preemptive, fine-grained or

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1730

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

for large number of active threads being systems. LCCP has low scheduling cost for both processed. homogeneous & heterogeneous systems. In some Implementations include DEC (later Compaq) recent papers list heuristic scheduling algorithms EV8 (not completed), Intel Hyper-Threading, keep their scheduling cost low by using a fixed IBM POWER5, Sun Microsystems UltraSPARC size heap & a FIFO, where heap always keeps T2, MIPS MT, & CRAY XMT. fixed number of tasks & excess tasks are inserted within FIFO. When heap has empty spaces, tasks 2. LITERATURE REVIEW are inserted within it from FIFO. Best known list Yeh-Ching Chung wrote on “Applications & scheduling algorithm based on this strategy Performance Analysis of A Compile-Time requires two heap restoration operations, one after Optimization Approach for List Scheduling extraction & another after insertion. Our LCCP Algorithms on algorithm improves on this by using only one such Multiprocessors” operation for both extraction & insertion, that They have proposedacompile-time optimization within theory reduces scheduling cost without approach, bottom-up top-down duplication compromising scheduling performance. In our heuristic (BTDH), for static scheduling of experiment they compare LCCP with other well directed+cyclic graphs (DAGS) on distributed known list scheduling algorithms & it shows that memory multiprocessors (DMMs). In this paper, LCCP is fastest among all. they discuss applications of BTDH for list Wayne F. Boyer wrote on “Non-evolutionary scheddhg algorithms (LSAs). There are two ways algorithm for scheduling dependent tasks to use BTDH for LSAs.BTDHcan be used with within distributed aLSAto form a new scheduling algorithm environments” (LSA/BTDH). It could be usedas apure The Problem of obtaining an optimal matching & optimization algorithm for a LSA (LSA-BTDH).. scheduling of interdependent tasks within Ishfaq Ahmad1 & Yu-Kwong Kwok2 wrote on distributed heterogeneous computing (DHC) “On Parallelizing Multiprocessor Scheduling environments is well known to be an NP-hard Problem” problem. In a DHC system, task execution time is Existing heuristics for scheduling a node & edge dependent on machine to which it is assigned & weighted directed task graph to multiple task precedence constraints are represented by a processors could produce satisfactory solutions directed acyclic graph. Recent research within but incur high time complexities that tend to evolutionary techniques has shown that genetic exacerbate within more realistic environments algorithms usually obtain more efficient schedules with relaxed assumptions. Consequently, these that other known algorithms. heuristics do not scale well & cannot handle We propose a non-evolutionary random problems of moderate sizes. The algorithm also scheduling (RS) algorithm for efficient matching exhibits an interesting trade-off between solution & scheduling of inter-dependent tasks within a quality & speedup while scaling well with DHC system. RS is a succession of randomized problem size. task orderings & a heuristic mapping from task Maruf Ahmed, Sharif M. H. Chowdhury wrote order to schedule. Randomized task ordering is on List Heuristic Scheduling Algorithms for effectively a topological sort where outcome may Distributed Memory Systems with Improved be any possible task order for which task Time Complexity precedent constraints are maintained. A detailed They present a compile time list heuristic comparison to existing evolutionary techniques scheduling algorithm called Low Cost Critical (GA & PSGA) shows proposed algorithm is less Path algorithm (LCCP) for distributed memory complex than evolutionary techniques, computes

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1731

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

schedules within less time, requires less memory 4. CHALLENGES WITHIN RESEARCH & fewer tuning parameters. Simulation results Multiple threads could interfere with each other show that average schedules produced by RS are when sharing hardware resources such as caches approximately as efficient as PSGA schedules for or translation lookaside buffers (TLBs). As a all cases studied & clearly more efficient than result, execution times of a single thread are not PSGA for certain cases. improved but could be degraded, even when only one thread is executing, due to lower frequencies 3 RESEARCH METHODOLOGY or additional pipeline stages that are necessary to In computing, scheduling is method by which accommodate thread-switching hardware. work specified by some means is assigned to Overall efficiency varies; Intel claims up to 30% resources that complete work. The work may be improvement with its HyperThreading virtual computation elements such as threads, technology,[1] while a synthetic program just processes or data flows, that are within turn performing a loop of non-optimized dependent scheduled onto hardware resources such as floating-point operations actually gains a 100% processors, network links or expansion cards. speed improvement when run within parallel. On A scheduler is what carries out scheduling other hand, hand-tuned assembly language activity. Schedulers are often implemented so they programs using MMX or Altivec extensions & keep all computer resources busy (as within load performing data pre-fetches (as a good video balancing), allow multiple users to share system encoder might) do not suffer from cache misses or resources effectively, or to achieve a target quality idle computing resources. Such programs of service. Scheduling is fundamental to therefore do not benefit from hardware computation itself, & an intrinsic part of execution multithreading & could indeed see degraded model of a computer system; concept of performance due to contention for shared scheduling makes it possible to have computer resources. multitasking with a single central processing unit From software standpoint, hardware support for (CPU). multithreading is more visible to software, A scheduler may aim at one of several goals, for requiring more changes to both application example, maximizing throughput (total amount of programs & operating systems than work completed per time unit), minimizing multiprocessing. Hardware techniques used to response time (time from work becoming enabled support multithreading often parallel software until first point it begins execution on resources), techniques used for of or minimizing latency (the time between work computer programs. Thread scheduling is also a becoming enabled & its subsequent completion), major problem within multithreading. maximizing fairness (equal CPU time to each , or more generally appropriate times according to priority & workload of each process). 5. In practice, these goals often conflict (e.g. Parallel computing is a type of computation in throughput versus latency), thus a scheduler which many calculations are carried out would implement a suitable compromise. simultaneously, operating on principle that large Preference is given to any one of concerns problems could often be divided into smaller ones, mentioned above, depending upon user's needs & which are then solved at same time. There are objectives. several different forms of parallel computing: bit- level, instruction-level, data, & . Parallelism has been employed for many years, mainly in high-performance computing, but

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1732

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

interest in it has grown lately due to physical disp(X); constraints preventing frequency scaling. As disp('Slice is'); power consumption (and consequently heat disp(Slice); generation) by has become a concern in disp(strcat('schedule length:',num2str((X)))); recent years, parallel computing has become ganghueristic function dominant paradigm in , function [Slice X]= ghp(u,p,m) mainly in form of multi-core processors. Parallel disp('reuirement'); computing is closely related to concurrent disp(u); computing—they are frequently used together, & disp('availability'); often conflated, though two are distinct: it is disp(p); possible to have parallelism without concurrency disp('no of processor'); & concurrency without parallelism. disp(m); c=u; n=length(u); disp('length is'); disp(n); r=c; disp('remaining processing times'); disp(r); Slice=[]; X=[]; Fig 2 SchedCompleted=false; In parallel computing, a computational task is tc=0; typically broken down in several, often many, eltm=0; very similar subtasks that could be processed walltime=zeros(1:4); independently & whose results are combined k=1; afterwards, upon completion. In contrast, in maxslice=max(r); , various processes often do

not address related tasks; when they do, as is %matlabpool open typical in , separate tasks while(~SchedCompleted) may have a varied nature & often require some used=0; inter-process communication during execution. sched=zeros(1,n);

X(k)=maxslice; 6. RESULT AND DISCUSSION for i=1:n Single processor output tic disp('problem with 7 tasks and 7 processors'); %for j=1:4 u=[6 8 3 2 1 7 9]; %tc=tc+maxNumCompThreads(2^(j-1)); p=[2 1 2 1 1 6 5]; if(r(i)>0 && used+p(i)<=m) m=7; sched(i)=1; disp(strcat('platform with used=used+p(i); ',num2str(m),'processors')); X(k)=min(X(k),r(i)); disp('task execution requirements:'); end [Slice X]=ghp12(u,p,m); walltime(i)=toc; disp('schedule slices(rows:tasks eltm=eltm+walltime(i); 1...n;column:schedule slices):'); %end

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1733

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

end disp(r); Slice(:,end+1)=sched; Slice=[]; r=r-(X(k).*sched); X=[]; if(r==0) SchedCompleted=false; SchedCompleted=true; tc=0; else eltm=0; k=k+1; walltime=zeros(1:4); end k=1; end maxslice=max(r); %matlabpool close disp('cpu cycle'); matlabpool open disp(tic); while(~SchedCompleted) disp('time'); used=0; disp(eltm); sched=zeros(1,n); end X(k)=maxslice; for i=1:n Code for Multithreaded based processor tic disp('problem with 7 tasks and 7 processors'); for j=1:4 u=[6 8 3 2 1 7 9]; tc=tc+maxNumCompThreads(2^(j-1)); p=[2 1 2 1 1 6 5]; if(r(i)>0 && used+p(i)<=m) m=7; sched(i)=1; disp(strcat('platform with used=used+p(i); ',num2str(m),'processors')); X(k)=min(X(k),r(i)); disp('task execution requirements:'); end [Slice X]=ghp(u,p,m); walltime(i)=toc; disp('schedule slices(rows:tasks eltm=eltm+walltime(i); 1...n;column:schedule slices):'); end disp(X); end disp('Slice is'); Slice(:,end+1)=sched; disp(Slice); r=r-(X(k).*sched); disp(strcat('schedule length:',num2str((X)))); if(r==0) modified ganghueristic SchedCompleted=true; function [Slice X]= ghp(u,p,m) else disp('reuirement'); k=k+1; disp(u); end disp('availability'); end disp(p); matlabpool close disp('no of processor'); disp('cpu cycle'); disp(m); disp(tic); c=u; disp('time'); n=length(u); disp(eltm); disp('length is'); end disp(n); r=c; Step 1 disp('remaining processing times');

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1734

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016

7. SCOPE OF RESEARCH If several threads work on same set of data, they could actually share their cache, leading to better cache usage or synchronization on its values. If a thread gets a lot of cache misses, other threads could continue taking advantage of unused computing resources, that may lead to faster overall execution as these resources would have been idle if only a single thread were executed. Also, if a thread cannot use all computing resources of CPU (because instructions depend on each other's result), running another thread may prevent those resources from becoming idle.

REFERENCE Fig 1 Task Execution requirement 1. Remzi H. Arpaci-Dusseau; Andrea C. Step 2 Arpaci- Dusseau (January 4, 2015). "Ch- apter 7: Scheduling: Introduction, Section 7.6: A New Metric: Response Time". Operating Systems: Three Easy Pieces (PDF). p. 6. Retrieved February 2, 2015. 2. Paul Krzyzanowski (2014-02-19). "Process Scheduling: Who gets to run next?". cs.rutgers.edu. Retrieved 2015-01- 11. 3. Abraham Silberschatz, Peter Baer Galvin & Greg Gagne (2013). Operating System Concepts 9. John Wiley & Sons,Inc. ISBN 978-1-118-06333-0. Fig 2. Finding schedule lenght 4. Here is C-code for FCFS 5. Early Windows at Wayback Machine 6. Sriram Krishnan. "A Tale of Two Schedulers Windows NT & Windows CE". 7. Inside Windows Vista Kernel: Part 1, Technet 8. "Vista Kernel Improvements". 9. "Technical Note TN2028 - Threading Architectures". 10. "Mach Scheduling & Thread Interfaces". 11. http://www.ibm.com/developerworks/aix/l ibrary/au-aix5_cpu/index.html#N100F6 12. Molnár, Ingo (2007-04-13). "[patch] Modular Scheduler Core & Completely Fair Scheduler [CFS]". -kernel (Mailing list).

Rupali , Shailja Kumara IJMEIT Volume 4 Issue 8 August 2016 Page 1735