<<

VXA: A Virtual Architecture for Durable Compressed Archives

Bryan Ford Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

http://pdos.csail.mit.edu/~baford/vxa/

The Ubiquity of

Everything is compressed these days – Archive/Backup/Distribution: , .gz, ... – Multimedia streams: , , wmv, ... – Office documents: XML­in­ZIP – Digital cameras: JPEG, proprietary RAW, ... – Video camcorders: DV, MPEG­2, ...

Compressed Data Formats

Observation #1: Data compression formats evolve rapidly

s es pr rc 2 C O a p R p Q U om R O H IP zi A zi z S L c A Z L Z g R b 7 — — — — — — — — — — —

1980 1985 1990 1995 2000 2005

Compressed Data Formats

Observation #1: Data compression formats evolve rapidly

s es pr rc 2 C O a p R p Q U om R O H IP zi A zi z S L c A Z L Z g R b 7 — — — — — — — — — — — Lossless Compression 0 00 2 P M F G G B F F X A E G E M L I I C G P N P B I T G P T J P J — — — — — — — — — Image Encoding e im 1 2 4 T ­ ­ ­ son 7 8 9 M k G G G n V V V I IC ic E E E re N L u P P V P o M M A F Q M M D M S W WM W — — — — — — — — — — — Video Encoding io ­C ud 7 is 9 X F F V 3 lA A C b A V F F A P a C A r S I I e A M L o M 8 A A W M R A W F V W — — — — — — — — — — — Audio Encoding

1980 1985 1990 1995 2000 2005 Compressed Data Formats

Observation #1: Data compression formats evolve rapidly Problems: – Inconvenient: each new algorithm requires decoder install/upgrade – Impedes data portability: data unusable on systems without supported decoder – Threatens long­term data usability: old decoders may not run on new operating systems

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — x86 Architecture

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

Fully Backward Compatible Extensions

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — x86 Architecture

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp S P 0 C I r A m 0 S R M R e iu 0 P A ­ w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp S P 0 C I r A m 0 S R M R e iu 0 P A ­ w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M ­R e C iu 0 IP A w 8 P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M R e iu 0 P A ­ w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures

1980 1985 1990 1995 2000 2005

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

r) r) o to ) ct c 4 ) 6 it X e e ­6 it 6 8 b v E 6 b 3 ­ M t S v 8 ­ 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M R e iu 0 P A ­ w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures

1980 1985 1990 1995 2000 2005

ic n a t I VXA: Virtual Executable Archives

Observation 1+2: Instruction formats are historically more durable than compressed data formats

Make archive self­extracting (data + executable decoder) To extract data, archive reader runs embedded decoder

Archive Archive Archive Writer Reader

Encoder

Decoder D D

Goals of VXA

Make self­extracting archives...

Archive Archive Archive Writer Reader

Encoder

Decoder D D

Goals of VXA

Make self­extracting archives... 1. Safe: malicious decoders can't compromise host 2. Future­proof: simple, well­defined architecture [Lorie]

Archive Archive Archive Writer Reader

Encoder Emulator Decoder D D

Goals of VXA

Make self­extracting archives... 1. Safe: malicious decoders can't compromise host 2. Future­proof: simple, well­defined architecture [Lorie] 3. Easy: allow reuse of existing code, languages, tools

Archive Archive Archive Writer Reader

Encoder x86 Emulator Decoder D D

Goals of VXA

Make self­extracting archives... 1. Safe: malicious decoders can't compromise host 2. Future­proof: simple, well­defined architecture [Lorie] 3. Easy: allow reuse of existing code, languages, tools 4. Efficient: practical for short term data packaging too

Archive Archive Archive Writer Reader

Encoder Fast x86 Emulator Decoder D D

Outline

● Archiver Operation

● vxZIP Archive Format

● Decoder Architecture

● Emulator Design & Implementation

● Evaluation (performance, storage overhead)

● Conclusion

Archive Writer Operation

VXA Archiver

Archive

Archive Writer Operation

Uncompressed Input Files

VXA Archiver General Compressor

Decoder1

D1

Archive

Archive Writer Operation

Uncompressed Input Files

VXA Archiver General Compressor

Decoder1

D1

Archive

Archive Writer Operation

Uncompressed Input Files

General Image Audio Compressor Compressor Compressor

Decoder1 Decoder2 Decoder3

D1 D2 D3

Archive

Archive Writer Operation

Uncompressed Input Files Pre­Compressed Input Files

General Image Audio Compressor Compressor Compressor

Decoder1 Decoder2 Decoder3

D1 D2 D3

Archive

Archive Writer Operation

Uncompressed Input Files Pre­Compressed Input Files

General Image Audio Image Format Audio Format Compressor Compressor Compressor Recognizer Recognizer

Decoder1 Decoder2 Decoder3 Decoder4 Decoder5

D1 D2 D3 D4 D5

Archive

Archive Reader Operation

VXA Archive Reader x86 Emulator

D1 D2 D3 D4 D5

Archive

Archive Reader Operation

Original Uncompressed Files

VXA Archive Reader x86 Emulator

Decoder1

D1 D2 D3 D4 D5

Archive

Archive Reader Operation

Original Uncompressed Files

VXA Archive Reader x86 Emulator

Decoder1 Decoder2 Decoder3

D1 D2 D3 D4 D5

Archive

Archive Reader Operation

Original Uncompressed Files Original Pre­Compressed Files

VXA Archive Reader x86 Emulator

Decoder1 Decoder2 Decoder3

D1 D2 D3 D4 D5

Archive

Archive Reader Operation

Original Uncompressed Files De­compressed Files

VXA Archive Reader x86 Emulator

Decoder1 Decoder2 Decoder3 Decoder4 Decoder5

D1 D2 D3 D4 D5

Archive

vxZIP Archive Format

● Backward compatible

with legacy ZIP format Image file

Audio file

Audio file

Central Directory

vxZIP Archive

vxZIP Archive Format

● Backward compatible JP2 Decoder

with legacy ZIP format Image file

● Decoders intermixed FLAC Decoder with archived files Audio file

Audio file

Central Directory

vxZIP Archive

vxZIP Archive Format

● Backward compatible JP2 Decoder

with legacy ZIP format Image file (JP2-encoded) ● Decoders intermixed FLAC Decoder with archived files Audio file ● Archived files have (FLAC-encoded) Audio file new extension header (FLAC-encoded) pointing to decoder Central Directory

vxZIP Archive

vxZIP Archive Format

● Backward compatible JP2 Decoder (deflated) with legacy ZIP format Image file (JP2-encoded) ● Decoders intermixed FLAC Decoder with archived files (deflated) Audio file ● Archived files have (FLAC-encoded) Audio file new extension header (FLAC-encoded) pointing to decoder Central Directory ● Decoders are hidden, vxZIP Archive “deflated” ()

vxZIP Decoder Architecture

● Decoders are ELF executables for x86­32 – Can be written in any language, safe or unsafe – Compiled using ordinary tools (GCC)

● Decoders have access to five “system calls”: – read stdin, write stdout, malloc, next file, exit

● Decoders cannot: – open files, windows, devices, network connections, ... – get system info: user name, current time, OS type, ...

Decoders Ported So Far (using existing implementations in C, mostly unmodified)

General­purpose (lossless):

● zlib: Classic gzip/deflate algorithm

: Burrows­Wheeler algorithm Still image codecs:

: Classic lossy image compression scheme

● jp2: JPEG 2000 ­based algorithm, lossy or lossless Audio codecs:

: Free Lossless Audio Codec

: Standard lossy audio codec for Ogg streams

vx32 Emulator Architecture

Runs in vxUnZIP process

● Loads decoder into address space sandbox

● Restricts decoder's VXA Decoder Decoder vxUnZIP Address Space Address memory accesses Process (up to 1GB) Space to sandbox Address Space 0 ● vx32 Emulator library Dispatches decoder's VXA VXA “system calls” System Calls to vxUnZIP vxUnZIP (not to host OS!) Application 0

vx32 Emulator Implementation

On x86­{32/64} hosts: Kernel Address Space – Secure fault isolation [Wahbe] – Data sandboxing via Decoder VXA Decoder Data custom LDT segments Code Address Space Segment Rewriting (up to 1GB) (LDT) – Code sandboxing via 0 Transformed code cache instruction rewriting Flat-Model vx32 Emulator library Controlled Code/Data Procedure [Sites, Nethercote] Segment Calls – No privileges or vxUnZIP kernel extensions Application 0

vx32 Emulator Implementation

On other host architectures: – Portable but slow “fallback” instruction interpreter (mostly done) – Fast x86­to­PowerPC binary translator (in progress) – Hopefully more in the future

Emulator implemented as generic library – Can be used for other sandboxing applications

Evaluation

Two issues to address:

● Performance overhead of emulated decoders – not important for long­term archival storage, but... – very important for common short­term uses of archives: backups, software distribution, structured documents, ...

● Storage overhead of archived decoders

Performance Test Method

Run 6 ported decoders on appropriate data sets – Athlon 64 3000+ PC running SuSE 9.3 – Measure user­mode CPU time (not wall­clock time) Compare: – Emulated vs native execution – Running on x86­32 vs x86­64 host environment

Performance Overhead

e 1.2 m i 1.1 T n

o 1 i t u 0.9 c e x 0.8 E

e 0.7 d o 0.6 m - r e 0.5 s U

0.4 d e

z 0.3 i l a 0.2 m r o 0.1 N 0 zlib bzip2 jpeg jp2 flac vorbis

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

Performance Overhead

e 1.2 m i 1.1 T n

o 1 i t u 0.9 c e x 0.8 E

e 0.7 d o 0.6 m - r e 0.5 s U

0.4 d e

z 0.3 i l a 0.2 m r o 0.1 N 0 zlib bzip2 jpeg jp2 flac vorbis

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

Storage Overhead

Archiver stores only one copy of each decoder – Storage cost amortized over all files of same type – Relative overhead depends on size of archive Therefore, measure only absolute decoder size (compressed, as stored in archive)

Storage Overhead

130 120 )

s 110 e t y 100 B K

( 90 e z

i 80 s

e 70 d Decoder o c 60 C library d e 50 s s

e 40 r p 30 m o

C 20 10 0 zlib bzip2 jpeg jp2 flac vorbis

Conclusion

VXA makes self­extracting archives...

● Safe: decoders fully sandboxed

● Future­proof: simple, OS­independent environment

● Easy: re­use existing decoders, languages, tools

● Efficient: ≤ 11% slowdown vs native x86­32

Available at: http://pdos.csail.mit.edu/~baford/vxa/