VXA: A Virtual Architecture for Durable Compressed Archives
Bryan Ford Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
http://pdos.csail.mit.edu/~baford/vxa/
The Ubiquity of Data Compression
Everything is compressed these days – Archive/Backup/Distribution: ZIP, tar.gz, ... – Multimedia streams: mp3, ogg, wmv, ... – Office documents: XMLinZIP – Digital cameras: JPEG, proprietary RAW, ... – Video camcorders: DV, MPEG2, ...
Compressed Data Formats
Observation #1: Data compression formats evolve rapidly
s es pr rc 2 C O a p R p Q U om R O H IP zi A zi z S L c A Z L Z g R b 7 — — — — — — — — — — — Lossless Compression
1980 1985 1990 1995 2000 2005
Compressed Data Formats
Observation #1: Data compression formats evolve rapidly
s es pr rc 2 C O a p R p Q U om R O H IP zi A zi z S L c A Z L Z g R b 7 — — — — — — — — — — — Lossless Compression 0 00 2 P M F G G B F F X A E G E M L I I C G P N P B I T G P T J P J — — — — — — — — — Image Encoding e im 1 2 4 T son 7 8 9 M k G G G n V V V I IC ic E E E re N L u P P V P o M M A F Q M M D M S W WM W — — — — — — — — — — — Video Encoding io C ud 7 is 9 X F F V 3 lA A C b A V F F A P a C A r S I I e A M L o M 8 A A W M R A W F V W — — — — — — — — — — — Audio Encoding
1980 1985 1990 1995 2000 2005 Compressed Data Formats
Observation #1: Data compression formats evolve rapidly Problems: – Inconvenient: each new algorithm requires decoder install/upgrade – Impedes data portability: data unusable on systems without supported decoder – Threatens longterm data usability: old decoders may not run on new operating systems
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — x86 Architecture
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
Fully Backward Compatible Extensions
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — x86 Architecture
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp S P 0 C I r A m 0 S R M R e iu 0 P A w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp S P 0 C I r A m 0 S R M R e iu 0 P A w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M R e C iu 0 IP A w 8 P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M R e iu 0 P A w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures
1980 1985 1990 1995 2000 2005
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
r) r) o to ) ct c 4 ) 6 it X e e 6 it 6 8 b v E 6 b 3 M t S v 8 08 0 2 n S P x 4 8 8 (3 M (i (F (6 — — — — — a x86 Architecture h C C lp C S P 0 I r A m 0 S R M R e iu 0 P A w C 8 I P R A o E tan 6 M S A P P D I — — — — — — — — Other Architectures
1980 1985 1990 1995 2000 2005
ic n a t I VXA: Virtual Executable Archives
Observation 1+2: Instruction formats are historically more durable than compressed data formats
Make archive selfextracting (data + executable decoder) To extract data, archive reader runs embedded decoder
Archive Archive Archive Writer Reader
Encoder
Decoder D D
Goals of VXA
Make selfextracting archives...
Archive Archive Archive Writer Reader
Encoder
Decoder D D
Goals of VXA
Make selfextracting archives... 1. Safe: malicious decoders can't compromise host 2. Futureproof: simple, welldefined architecture [Lorie]
Archive Archive Archive Writer Reader
Encoder Emulator Decoder D D
Goals of VXA
Make selfextracting archives... 1. Safe: malicious decoders can't compromise host 2. Futureproof: simple, welldefined architecture [Lorie] 3. Easy: allow reuse of existing code, languages, tools
Archive Archive Archive Writer Reader
Encoder x86 Emulator Decoder D D
Goals of VXA
Make selfextracting archives... 1. Safe: malicious decoders can't compromise host 2. Futureproof: simple, welldefined architecture [Lorie] 3. Easy: allow reuse of existing code, languages, tools 4. Efficient: practical for short term data packaging too
Archive Archive Archive Writer Reader
Encoder Fast x86 Emulator Decoder D D
Outline
● Archiver Operation
● vxZIP Archive Format
● Decoder Architecture
● Emulator Design & Implementation
● Evaluation (performance, storage overhead)
● Conclusion
Archive Writer Operation
VXA Archiver
Archive
Archive Writer Operation
Uncompressed Input Files
VXA Archiver General Compressor
Decoder1
D1
Archive
Archive Writer Operation
Uncompressed Input Files
VXA Archiver General Compressor
Decoder1
D1
Archive
Archive Writer Operation
Uncompressed Input Files
General Image Audio Compressor Compressor Compressor
Decoder1 Decoder2 Decoder3
D1 D2 D3
Archive
Archive Writer Operation
Uncompressed Input Files PreCompressed Input Files
General Image Audio Compressor Compressor Compressor
Decoder1 Decoder2 Decoder3
D1 D2 D3
Archive
Archive Writer Operation
Uncompressed Input Files PreCompressed Input Files
General Image Audio Image Format Audio Format Compressor Compressor Compressor Recognizer Recognizer
Decoder1 Decoder2 Decoder3 Decoder4 Decoder5
D1 D2 D3 D4 D5
Archive
Archive Reader Operation
VXA Archive Reader x86 Emulator
D1 D2 D3 D4 D5
Archive
Archive Reader Operation
Original Uncompressed Files
VXA Archive Reader x86 Emulator
Decoder1
D1 D2 D3 D4 D5
Archive
Archive Reader Operation
Original Uncompressed Files
VXA Archive Reader x86 Emulator
Decoder1 Decoder2 Decoder3
D1 D2 D3 D4 D5
Archive
Archive Reader Operation
Original Uncompressed Files Original PreCompressed Files
VXA Archive Reader x86 Emulator
Decoder1 Decoder2 Decoder3
D1 D2 D3 D4 D5
Archive
Archive Reader Operation
Original Uncompressed Files Decompressed Files
VXA Archive Reader x86 Emulator
Decoder1 Decoder2 Decoder3 Decoder4 Decoder5
D1 D2 D3 D4 D5
Archive
vxZIP Archive Format
● Backward compatible
with legacy ZIP format Image file
Audio file
Audio file
Central Directory
vxZIP Archive
vxZIP Archive Format
● Backward compatible JP2 Decoder
with legacy ZIP format Image file
● Decoders intermixed FLAC Decoder with archived files Audio file
Audio file
Central Directory
vxZIP Archive
vxZIP Archive Format
● Backward compatible JP2 Decoder
with legacy ZIP format Image file (JP2-encoded) ● Decoders intermixed FLAC Decoder with archived files Audio file ● Archived files have (FLAC-encoded) Audio file new extension header (FLAC-encoded) pointing to decoder Central Directory
vxZIP Archive
vxZIP Archive Format
● Backward compatible JP2 Decoder (deflated) with legacy ZIP format Image file (JP2-encoded) ● Decoders intermixed FLAC Decoder with archived files (deflated) Audio file ● Archived files have (FLAC-encoded) Audio file new extension header (FLAC-encoded) pointing to decoder Central Directory ● Decoders are hidden, vxZIP Archive “deflated” (gzip)
vxZIP Decoder Architecture
● Decoders are ELF executables for x8632 – Can be written in any language, safe or unsafe – Compiled using ordinary tools (GCC)
● Decoders have access to five “system calls”: – read stdin, write stdout, malloc, next file, exit
● Decoders cannot: – open files, windows, devices, network connections, ... – get system info: user name, current time, OS type, ...
Decoders Ported So Far (using existing implementations in C, mostly unmodified)
Generalpurpose (lossless):
● zlib: Classic gzip/deflate algorithm
● bzip2: BurrowsWheeler algorithm Still image codecs:
● jpeg: Classic lossy image compression scheme
● jp2: JPEG 2000 waveletbased algorithm, lossy or lossless Audio codecs:
● flac: Free Lossless Audio Codec
● vorbis: Standard lossy audio codec for Ogg streams
vx32 Emulator Architecture
Runs in vxUnZIP process
● Loads decoder into address space sandbox
● Restricts decoder's VXA Decoder Decoder vxUnZIP Address Space Address memory accesses Process (up to 1GB) Space to sandbox Address Space 0 ● vx32 Emulator library Dispatches decoder's VXA VXA “system calls” System Calls to vxUnZIP vxUnZIP (not to host OS!) Application 0
vx32 Emulator Implementation
On x86{32/64} hosts: Kernel Address Space – Secure fault isolation [Wahbe] – Data sandboxing via Decoder VXA Decoder Data custom LDT segments Code Address Space Segment Rewriting (up to 1GB) (LDT) – Code sandboxing via 0 Transformed code cache instruction rewriting Flat-Model vx32 Emulator library Controlled Code/Data Procedure [Sites, Nethercote] Segment Calls – No privileges or vxUnZIP kernel extensions Application 0
vx32 Emulator Implementation
On other host architectures: – Portable but slow “fallback” instruction interpreter (mostly done) – Fast x86toPowerPC binary translator (in progress) – Hopefully more in the future
Emulator implemented as generic library – Can be used for other sandboxing applications
Evaluation
Two issues to address:
● Performance overhead of emulated decoders – not important for longterm archival storage, but... – very important for common shortterm uses of archives: backups, software distribution, structured documents, ...
● Storage overhead of archived decoders
Performance Test Method
Run 6 ported decoders on appropriate data sets – Athlon 64 3000+ PC running SuSE Linux 9.3 – Measure usermode CPU time (not wallclock time) Compare: – Emulated vs native execution – Running on x8632 vs x8664 host environment
Performance Overhead
e 1.2 m i 1.1 T n
o 1 i t u 0.9 c e x 0.8 E
e 0.7 d o 0.6 m - r e 0.5 s U
0.4 d e
z 0.3 i l a 0.2 m r o 0.1 N 0 zlib bzip2 jpeg jp2 flac vorbis
native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64
Performance Overhead
e 1.2 m i 1.1 T n
o 1 i t u 0.9 c e x 0.8 E
e 0.7 d o 0.6 m - r e 0.5 s U
0.4 d e
z 0.3 i l a 0.2 m r o 0.1 N 0 zlib bzip2 jpeg jp2 flac vorbis
native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64
Storage Overhead
Archiver stores only one copy of each decoder – Storage cost amortized over all files of same type – Relative overhead depends on size of archive Therefore, measure only absolute decoder size (compressed, as stored in archive)
Storage Overhead
130 120 )
s 110 e t y 100 B K
( 90 e z
i 80 s
e 70 d Decoder o c 60 C library d e 50 s s
e 40 r p 30 m o
C 20 10 0 zlib bzip2 jpeg jp2 flac vorbis
Conclusion
VXA makes selfextracting archives...
● Safe: decoders fully sandboxed
● Futureproof: simple, OSindependent environment
● Easy: reuse existing decoders, languages, tools
● Efficient: ≤ 11% slowdown vs native x8632
Available at: http://pdos.csail.mit.edu/~baford/vxa/