CRAMES: Compressed RAM for Embedded Systems

Robert Dick http://robertdick.org/ or http://ziyang.eecs.northwestern.edu/∼dickrp/ Department of Electrical Engineering and Northwestern University

Application Application Application

Swap

Page Page Decomp Comp

CRAMES

Virtual Memory Introduction CRAMES design Experimental evaluation Collaborators on project

NEC

Northwestern University NEC Labs America Lei Yang Dr. Haris Lekatsas Lan S. Bai Dr. Srimat Chakradhar Robert P. Dick

2 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Outline

1. Introduction Motivation Past work

2. CRAMES design Design overview Pattern-based partial match compression

3. Experimental evaluation Experimental setup Commercialization and future directions

3 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Problem background

RAM quantity limits application functionality

RAM price dropping but usage growing faster Secure Internet access, email, music, and games

How much RAM? Functionality Cost Power consumption Size

4 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Ideal hardware–software design process

Ideal case Hardware and software engineers collaborate on system-level design from start to finish

We teach the advantages of this in our classes It doesn’t always happen

5 Robert Dick CRAMES HW engineers SW engineers

Introduction Motivation CRAMES design Past work Experimental evaluation Real hardware–software design process

6 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Real hardware–software design process

HW engineers SW engineers

6 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Options

Option 1: Add more memory Implications: Hardware redesign, miss shipping target, get fired

Option 2: Rip out memory-hungry application features Implications: Lose market to competitors, fail to recoup design and production costs, get fired

Option 3: Make it seem as if memory increased without changing hardware, without changing applications, and without performance or power consumption penalties Nobody knew how to do this

7 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Goals

Allow application RAM requirements to overrun initial estimates even after hardware design Reduce physical RAM, negligible performance and energy cost Improve functionality or performance with same physical RAM

8 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation HW RAM compression

Tremaine 2001, Benini 2002, Moore 2003 Hardware (de)compression unit between cache and RAM Hardware redesign and application-specific compression hardware Past work claimed application-specific hardware essential to keep power and performance overhead low

9 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Code compression

Lekatsas 2000, Xu 2004 Store code compressed, decompress during execution Compress off-line, decompress on-line For RAM, less important than on-line

10 Robert Dick CRAMES Introduction Motivation CRAMES design Past work Experimental evaluation Compression for swap performance

Compressed caching Douglis 1993, Russinovich 1996, Wilson 1999, Kjelso 1999 Add compressed software cache to VM

Swap compression RamDoubler, Cortez 2000, Roy 2001, Chihaia 2005 Compress swapped-out pages and store them in software cache

Both techniques Target: general purpose system with disks Goal: improve system performance Interface to backing store (disk)

11 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Outline

1. Introduction Motivation Past work

2. CRAMES design Design overview Pattern-based partial match compression

3. Experimental evaluation Experimental setup Commercialization and future directions

12 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Design principles

Page selection Scheduling compression and decompression Organizing compressed and uncompressed regions Dynamically adjust compressed region size

Compression scheme High performance Energy efficient Good compression ratio Low memory requirement

13 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation CRAMES design

Page Fault Memory Low

8 1 1 8

Kernel Swapping System

7 2 2 7

CRAMES Device Driver

6 6 3 3 4 Block Memory Block Decompressor Manager Compressor

5 5 4

Virtual Memory

MMU

RAM

14 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation System configuration

Slightly slower Ramdisk with Original CRAMES file system 20 MB

Ramdisk Compressed RAM with file system device with file system Main memory 12 MB 12 MB working area 10 MB

Main memory working area Slightly slower Main memory 10 MB main memory working area working area 20 MB Compressed 20 MB swap device 10 MB

System/User view System view User view

15 Robert Dick CRAMES Compression Ratio Working Memory

7600kB 0.39 10000 3700kB

0.34 bzip2 1000 zlib default 256KB 0.29 zilb level 1 100 Compression (bytes) 64KB

zlib level 9 2 44KB 0.24 Decompression RLE log 16KB 16KB LZRW-1A 10 0.19 Compression Ratio Compression LZO 00 0 0.14 1 512 1024 2048 4096 8192 LZO bzip2 zlib default LZO LZRW-1A RLE Block Size (bytes) best existing Compression compression Compression Time for Decompression Time 0.0025 0.0012 CRAMES 0.001 bzip2 0.002 bzip2 zlib default zilb default 0.0008 zlib level 1 0.0015 zlib level 1 zlib level 9 zlib level 9 0.0006 RLE 0.001 RLE Time (sec) Time Time (sec) Time 0.0004 LZRW-1A LZRW-1A 0.0005 LZO LZO 0.0002

0 0 512 1024 2048 4096 8192 512 1024 2048 4096 8192 Block Size (bytes) Block Size (bytes)

Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Compression algorithm

Compression Ratio Working Memory

7600kB 0.39 10000 3700kB

0.34 bzip2 1000 zlib default 256KB 0.29 zilb level 1 100 Compression (bytes) 64KB

zlib level 9 2 44KB 0.24 Decompression RLE log 16KB 16KB LZRW-1A 10 0.19 Compression Ratio Compression LZO 00 0 0.14 1 512 1024 2048 4096 8192 bzip2 zlib default LZO LZRW-1A RLE Block Size (bytes) Compression Algorithms

Compression Time Decompression Time

0.0025 0.0012

0.001 bzip2 0.002 bzip2 zlib default zilb default 0.0008 zlib level 1 0.0015 zlib level 1 zlib level 9 zlib level 9 0.0006 RLE 0.001 RLE Time (sec) Time Time (sec) Time 0.0004 LZRW-1A LZRW-1A 0.0005 LZO LZO 0.0002

0 0 512 1024 2048 4096 8192 512 1024 2048 4096 8192 Block Size (bytes) Block Size (bytes)

16 Robert Dick CRAMES Compression Ratio Working Memory

7600kB 0.39 10000 3700kB

0.34 bzip2 1000 zlib default 256KB 0.29 zilb level 1 100 Compression (bytes) 64KB

zlib level 9 2 44KB 0.24 Decompression RLE log 16KB 16KB LZRW-1A 10 0.19 Compression Ratio Compression LZO 00 0 0.14 1 512 1024 2048 4096 8192 bzip2 zlib default LZO LZRW-1A RLE Block Size (bytes) Compression Algorithms

Compression Time Decompression Time

0.0025 0.0012

0.001 bzip2 0.002 bzip2 zlib default zilb default 0.0008 zlib level 1 0.0015 zlib level 1 zlib level 9 zlib level 9 0.0006 RLE 0.001 RLE Time (sec) Time Time (sec) Time 0.0004 LZRW-1A LZRW-1A 0.0005 LZO LZO 0.0002

0 0 512 1024 2048 4096 8192 512 1024 2048 4096 8192 Block Size (bytes) Block Size (bytes)

Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Compression algorithm

Compression Ratio Working Memory

7600kB 0.39 10000 3700kB

0.34 bzip2 1000 zlib default 256KB 0.29 zilb level 1 100 Compression (bytes) 64KB

zlib level 9 2 44KB 0.24 Decompression RLE log 16KB 16KB LZRW-1A 10 0.19 Compression Ratio Compression LZO 00 0 0.14 1 512 1024 2048 4096 8192 LZO bzip2 zlib default LZO LZRW-1A RLE Block Size (bytes) best existing Compression Algorithms compression Compression Time algorithm for Decompression Time 0.0025 0.0012 CRAMES 0.001 bzip2 0.002 bzip2 zlib default zilb default 0.0008 zlib level 1 0.0015 zlib level 1 zlib level 9 zlib level 9 0.0006 RLE 0.001 RLE Time (sec) Time Time (sec) Time 0.0004 LZRW-1A LZRW-1A 0.0005 LZO LZO 0.0002

0 0 512 1024 2048 4096 8192 512 1024 2048 4096 8192 Block Size (bytes) Block Size (bytes)

16 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Weak link: Compression algorithm

LZO average performance penalty 9.5% when RAM reduced to 40% Developed simulation environment to permit profiling Compression and decompression were taking most time Needed a better compression algorithm

17 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Weak link: Compression algorithm

LZO average performance penalty 9.5% when RAM reduced to 40% Developed simulation environment to permit profiling Compression and decompression were taking most time Needed a better compression algorithm

17 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Pattern-based partial dictionary match coding

Input 32-bit word

N Y Output dictionary Is something similar Output value location and in dictionary? difference

Eliminate LRU dictionary entry Update dictionary entry

Place entry in dictionary Move entry to front of dictionary

18 Robert Dick CRAMES 100% xmxx xzxz 90% zxxz 80% Nearby values similar xzzx zxzz 70% (data, pointers): 27% zzxz xzzz 60% Small values: 10% xxzz xxzx 50% xzxx xxxz 40%

Percentage % Percentage zxxx 30% zzxx zxzx 20% Zeros: 38% mmmx mmxx 10% mmmm zzzx 0% xxxx ) ) ) 2 ,1) 4 ,1) ,4 ,2) 1) 2) (8, (4,4) (8, 6, zzzz (16,1) (32 (16,2) (64 (32,2) (16 (64 (32,4) 28, (64,4) (128,1) (25 (1 Dictionary layout (size, level)

Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Data regularity

100% xmxx xzxz 90% zxxz

80% xzzx zxzz 70% zzxz xzzz 60% xxzz xxzx 50% xzxx xxxz 40%

Percentage % Percentage zxxx 30% zzxx zxzx 20% mmmx mmxx 10% mmmm zzzx 0% xxxx ) ) ) 2 ,1) 4 ,1) ,4 ,2) 1) 2) (8, (4,4) (8, 6, zzzz (16,1) (32 (16,2) (64 (32,2) (16 (64 (32,4) 28, (64,4) (128,1) (25 (1 Dictionary layout (size, level)

19 Robert Dick CRAMES 100% xmxx xzxz 90% zxxz

80% xzzx zxzz 70% zzxz xzzz 60% xxzz xxzx 50% xzxx xxxz 40%

Percentage % Percentage zxxx 30% zzxx zxzx 20% mmmx mmxx 10% mmmm zzzx 0% xxxx ) ) ) 2 ,1) 4 ,1) ,4 ,2) 1) 2) (8, (4,4) (8, 6, zzzz (16,1) (32 (16,2) (64 (32,2) (16 (64 (32,4) 28, (64,4) (128,1) (25 (1 Dictionary layout (size, level)

Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Data regularity

100% xmxx xzxz 90% zxxz 80% Nearby values similar xzzx zxzz 70% (data, pointers): 27% zzxz xzzz 60% Small values: 10% xxzz xxzx 50% xzxx xxxz 40%

Percentage % Percentage zxxx 30% zzxx zxzx 20% Zeros: 38% mmmx mmxx 10% mmmm zzzx 0% xxxx ) ) ) 2 ,1) 4 ,1) ,4 ,2) 1) 2) (8, (4,4) (8, 6, zzzz (16,1) (32 (16,2) (64 (32,2) (16 (64 (32,4) 28, (64,4) (128,1) (25 (1 Dictionary layout (size, level)

19 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation Pattern-based partial match

Algorithm Consider each 32-bit word as an input Allow partial dictionary match Use most frequent patterns based on statistical analysis

Optimizations Two-way associative LRU 16-entry hash-mapped dictionary Optimized coding scheme Early termination for uncompressable data Fine-grained operation parallelization

20 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation PBPM evaluations

PBPM twice as fast as LZO Compression ratio degrades by 10% Improves performance and still doubles usable memory

21 Robert Dick CRAMES Introduction Design overview CRAMES design Pattern-based partial match compression Experimental evaluation CRAMES and in-RAM filesystem compression

Compressed Kernel RAM disk 7% reserved 37.5% 2.2 MB 12 MB

55.6% 17.8 MB Main memory

22 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Outline

1. Introduction Motivation Past work

2. CRAMES design Design overview Pattern-based partial match compression

3. Experimental evaluation Experimental setup Commercialization and future directions

23 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Experimental setup

Sharp Zaurus SL-5600 PDA Intel XScale PXA250 32 MB flash memory 32 MB RAM Embedix (Linux 2.4.18 kernel) Qt/Qtopia PDA edition

24 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation CRAMES for compressed filesystems

CRAMES was used to create a compressed RAM device on Zaurus SL-5600 EXT2 file system LZO compression Resource map memory allocation Benchmarks: common file system operations Increase capacity to 1.6× Execution time increases by 8.4% Energy consumption increases by 5.2%

25 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Benchmarks

Impact of using CRAMES to reduce physical RAM Results for ADPCM: Speech compression application from MediaBench JPEG: Image encoding application from MediaBench MPEG2: application from MediaBench Straight-forward matrix multiplication Intentionally difficult for CRAMES Also tested on 10 GUI applications that came with Qtopia Next-generation cellphone prototype

26 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Results

Reduced RAM from 20 MB to 8 MB Base case: 20 MB RAM, no compression

Without CRAMES All suffered significant performance penalties Matrix multiplication cannot execute

With CRAMES LZO average case 9% overhead, worst case 29% PBPM average case 2.5% overhead, worst case 9%

Also works on arbitrary in-RAM filesystems

27 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Status

Patent pending: Northwestern University and NEC Millions of cellphones using CRAMES started shipping in June 2007 Technology won Computerwold Horizon Award in 2007

28 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation What did this require?

Bunny suit army?

29 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation What did this require?

Bunny suit army? No! One good Ph.D. student.

30 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Another direction Memory expansion for MMU-less embedded systems

Observations and Results user application Application: Sensor networks LLVM compiler Implemented in LLVM, Instruction byte code replacement, tested on TelosB nodes MEMMU compiler compiler-time optimizations Increases usable memory by Calls to 50%, unchanged applications MEMMU modified byte code library Little overhead after MEMMU LLVM compiler runtime compiler optimizations library executable CASES’06

With Lan Bai and Lei Yang

31 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Acknowledgments

This work was supported by NEC Labs America and NSF award CCR-0347941. We would like to acknowledge Hui Ding of Northwestern University for his suggestions on efficiently implementing PBPM and his help testing CRAMES with GUI applications. Professors Peter Dinda, Gokhan Memik, and Lawrence Henschen provided suggestions and feedback.

32 Robert Dick CRAMES Introduction Experimental setup CRAMES design Commercialization and future directions Experimental evaluation Thank you for attending!

For additional information about this work CRAMES and MEMMU at http://ziyang.eecs.northwestern.edu/∼dickrp/tools.html

For additional information about the team that created it http://ziyang.eecs.northwestern.edu/∼dickrp/people/students.html

33 Robert Dick CRAMES