Guidelines for Using Compare-by-hash Val Henson Richard Henderson IBM, Inc. Red Hat, Inc.
[email protected] [email protected] Abstract hash collision. Recently, a new technique called compare-by-hash Compare-by-hash made its first appearance in 1996 has become popular. Compare-by-hash is a method in rsync[26], a tool for synchronizing files over of content-based addressing in which data is identi- the network, and has since been used in a num- fied only by the cryptographic hash of its contents. ber of projects, including LBFS[14], a distributed Hash collisions are ignored, with the justification file system; Stanford’s virtual computer migration that they occur less often than many kinds of hard- project[19]; Pastiche[7], an automated backup sys- ware errors. Compare-by-hash is a powerful, ver- tem; OpenCM[21], a configuration management sys- satile tool in the software architect’s bag of tricks, tem; and CASPER[24], a block cache for distributed but it is also poorly understood and frequently mis- file systems. Venti, a block archival storage sys- used. The consequences of misuse range from signif- tem for Plan 9, is described as using compare-by- icant performance degradation to permanent, unre- hash[18] but as implemented it checks for collisions coverable data corruption or loss. The proper use of on writes[9]. Compare-by-hash is also used in a compare-by-hash is a subject of debate[10, 29], but wide variety of document retrieval systems, content- recent results in the field of cryptographic hash func- caching proxies, and many other commercial soft- tion analysis, including the breaking of MD5[28] and ware products.