<<

Low-Cost -Based Correlation Prefetching for Commercial Applications Yuan Chou Architecture Technology Group Microelectronics Division

1 Motivation Performance of many commercial applications limited by processor stalls due to off-chip cache misses Applications characterized by irregular control-flow and complex data access patterns Software prefetching and simple stride-based hardware prefetching ineffective Hardware correlation prefetching more promising - can remember complex recurring data access patterns Current correlation prefetchers have severe drawbacks but we think we can overcome them

2 Talk Outline Traditional Correlation Prefetching Epoch-Based Correlation Prefetching Experimental Results Summary

3 Traditional Correlation Prefetching Basic idea: use current miss address M to predict N future miss addresses F ...F (where N = prefetch depth) 1 N Miss address sequence: A B C D E F G H I assume N=2 Use A to prefetch B C Use D to prefetch E F Use G to prefetch H I Correlations recorded in correlation table Correlation table size proportional to application working set 4 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I

A C F H B D G I E compute off-chip access

5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I

time A C F H B D G I E Since C, D and E naturally overlapped, prefetching only C may not improve performance

5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Prefetches misses naturally overlapped with current miss Miss address sequence: A B C D E F G H I

time A C F H B D G I E Since A and B naturally overlapped, prefetching B does not improve performance but wastes table storage

5 Epoch-Based Correlation Prefetching (EBCP)

6 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of on- chip computation followed by off-chip accesses

time A C F H B D G I E compute off-chip access

7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of on- chip computation followed by off-chip accesses

time A C F H B D G I E

Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch

7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of on- chip computation followed by off-chip accesses

time A C F H B D G I E

Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch Group off-chip accesses based on which epoch they issue Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I 7 Epoch Model Insights Insight #1: Target removal of entire epochs instead of individual misses Miss address sequence: A B C D E F G H I

time A C F H B D G I E Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B CX D E XF G H I Use first miss in epoch to prefetch all misses in next 2 epochs

Results in removal of 2 epochs 8 Epoch-Based Correlation Prefetcher No prefetching Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Epoch-based correlation prefetching (EBCP) Epoch i i+1 Miss addresses A B H I Prefetches C D E F G Traditional correlation prefetching (depth=2) Epoch i i+1 i+2 Miss addresses A B E H I Prefetches B C D F G EBCP achieves better epoch reduction 9 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch

time A C F H B D G I E

Epoch i Epoch i+1 Epoch i+2 Epoch i+3

Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Read Prefetches c o r r e l a t io n F G H I table Use miss in epoch i to prefetch all misses in epochs i+2 and i+3 Use epoch i to read correlation table Use epoch i+1 to issue prefetches 10 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch

time A C F H B D G I E

Epoch i Epoch i+1 Epoch i+2 Epoch i+3

Epoch i i+1 i+2 i+3 Miss addresses A B C D E XF G XH I Read Prefetches c o r r e l a t io n F G H I table Results in removal of 2 epochs Correlation table can be stored in main memory!

10 EBCP Advantages Trad: store correlation table on-chip EBCP: store correlation table in main memory (hide table access latency under previous epoch) Trad: no attempt to eliminate all naturally overlapped misses EBCP: target removal of entire epochs Trad: prefetch misses naturally overlapped with current miss EBCP: avoid prefetching these misses EBCP overcomes drawbacks of traditional correlation prefetchers

11 EBCP Components

L1-I L1-I Processor Processor Prefetch Core L1-D Core L1-D Control

Crossbar

L2 L2 L2 L2 bank bank bank bank

Memory Controller Memory Controller

Correlation Table Correlation Table DRAM DRAM

• Prefetcher control observes all L2 cache requests

• L2 banks notify prefetcher control which requests are misses 12 EBCP Prefetcher Control Request OS for memory to store correlation table Detect epochs observe when number of off-chip misses transition 0 to 1 Learn correlations record correlations in main memory correlation table Issue prefetches use first miss address in epoch to look up correlation table select miss addresses from correlation table entry issue prefetches (lower priority than demand accesses) Return memory to OS if needed EBCP very simple and requires almost zero on-chip storage! 13 Experimental Results

14 Baseline Processor Model Moderate out-of-order issue core single thread 4-wide issue 64 entry issue queue, 128 entry reorder buffer 32KB 4-way L1 instruction and data caches 2MB 4-way L2 cache prefetches installed into prefetch buffer Memory bandwidth model 9.6 GB/s read bandwidth 4.8 GB/s write bandwidth 500 cycle unloaded memory latency Commercial applications benchmarks OLTP, TPC-W, SPECjbb2005, SPECjAppServer2004 15 E f P % Performance Improvement f 1 1 2 2 3 3 4 4 e e 0 5 0 5 0 5 0 5 0 5 r % % % % % % % % % % fo c r t m 2 s

a 4 O n

L

6 o c T e 8 P f

i 12 m 16 P p r r 32 o e v e 2 f m e

4 TP e t

6 C n c P - W t 8 r

h e i n

f 12 e c

t 16 c D r h e 32

a D e s e e g 2 g r s S e 4 PE

r e w

6 e C i th j

8 bb e

12 p

r 16 e I n f 32 e f i n

tc i SP t e

h 2

c EC

4 o d r r j e A e 6 l p a g t 8 pS i r o e n

12 e

e t r a 16 v b e

l r

32 e 1 6 Coverage vs Accuracy 60% OLTP TPC-W SPECjbb SPECjAppServer 50% e g

a 40% r e

v 30% o

C 20% % 10% 0%

2 4 6 8 2 6 2 2 4 6 8 2 6 2 2 4 6 8 2 6 2 2 4 6 8 2 6 2 1 1 3 1 1 3 1 1 3 1 1 3 Prefetch Degree

50% OLTP TPC-W SPECjbb SPECjAppServer

y 40% c a r

u 30% c c

A 20% % 10%

0%

2 4 6 8 2 6 2 2 4 6 8 2 6 2 2 4 6 8 2 6 2 2 4 6 8 2 6 2 1 1 3 1 1 3 1 1 3 1 1 3 • take-away Prefetch Degree 17 M O

% Performance Improvement e 1 1 2 2 3 3 4 4 p - 0 5 0 5 0 5 0 5 0 5 5 m t % % % % % % % % % % % i m o

a 4 O l p r 8 L T y r P e 16

f B e

t 32 c a h

n d

e 4 TP d g C P r 8 w - e r W e e

f 16 e i

d t d c

e 32 h p t

D h e e n g

S d r 4 e S P s e EC

8 o e j bb n

16 n

a s v 32 a i i

S l t a PE i b

4 v C l e j i A

8 t p m pS y

e 16 e m r

32 v e o r B B B r W W W y = = =

B 9 6 3 . . . 6 4 2 W G G G B B B

/ / / s s s 1 8 Correlation Table Size Prefetch degree 8 35% OLTP TPC-W SPECjbb SPECjAppServer

t 30% n me

e 25% v o r p

m 20% I

e c

n 15% ma r o

f 10% r e P 5% %

0%

K K K K K K K K K K K K K K K K M M M M M M M M M M M M M M M M 4 8 6 2 4 8 6 2 4 8 6 2 4 8 6 2 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 6 2 5 1 6 2 5 1 6 2 5 1 6 2 5 1 1 2 5 1 2 5 1 2 5 1 2 5 Predictor Table Entries Storing table in main memory makes such large sizes practical 19 Comparison with Other Prefetchers Global Buffer G/AC (GHB) address correlation, unique table storage (small: 256KB large: 4MB) Tag Correlating Prefetcher (TCP) tag correlation (small: 256KB large: 4MB) Stream traditional stride-based stream prefetcher Spatial Memory Streaming (SMS) spatial locality within region (128KB) Solihin memory-side address correlation prefetcher (64MB) Prefetch degree = 6 for all prefetchers (except SMS) Prefetches brought into 64 entry prefetch buffer 20 Comparison with Other Prefetchers 25% SPECjbb SPECjAppServer t OLTP TPC-W n 20% me e v o r

p 15% m I

e c n 10% ma r o f r

e 5% P

%

0%

ll ll ll ll ll ll ll ll s s s s e e 2 1 e e 2 1 e e 2 1 e e 2 1 S P S P S P S P m , , m , , m , , m , , a a a a a a a a u u u u g g g g g a 3 6 C a 3 6 C a 3 6 C a 3 6 C r r rg rg r r rg r M M M M n n n n

m m i m m i m m i m m i e e e e a a a a a a a a B B B B S S S S l l l l l l l l r r r r s s s s s s s s in in in in in in in in

t t t t

m m m m E E E E

h h B P B P B P B P S ih ih S i ih S ih i S ih ih B P B P B P B P l l l l l l l l P P P P H C H C H C H C H C o o H C o o H C o o H C o o C C C C T T T T T T T T G S S G S S G S S G S S G G G G B B B B E E E E EBCP outperforms all prefetchers for all four benchmarks 21 Summary EBCP successfully overcomes drawbacks of traditional correlation prefetchers stores large correlation table in main memory exploits unused memory capacity and bandwidth targets removal of entire epochs very simple prefetcher control almost zero on-chip storage EBCP performs very well on all four commercial benchmarks Future work: efficient implementation for chip multi-processors improved accuracy Epoch-based concept can be applied to other uarch techniques! 22 Yuan Chou [email protected]

28 Prefetch degree 8 Prefetch Buffer Size 1 million table entries 35% OLTP TPC-W SPECjbb SPECjAppServer t

n 30% me e

v 25% o r p m

I 20%

e c n 15% ma r o f

r 10% e P

% 5%

0%

6 2 4 8 6 6 2 4 8 6 6 2 4 8 6 6 2 4 8 6 1 3 6 2 5 1 3 6 2 5 1 3 6 2 5 1 3 6 2 5 1 2 1 2 1 2 1 2 Prefetch Buffer Entries 64 entries sufficient for all four benchmarks 19