Published in for You, August 2008 issue.

------Checksums, your best friends, for security

S. Parthasarathy [email protected]

Imagine that you write an electronic cheque for Rs. 1000 , payable to your friend, and send this cheque electronically to your friend. How do you,or your friend, or the bank ensure that the cheque has not been tampered with, or altered enroute. How do you ensure the authenticity of the cheque, particularly the amount payable and the person to whom payable ? This same problem can appear in various other forms in your day­to­day experience. How are you sure that the file you received as an attachment has not been altered on its way ? How are you sure that the iso image you downloaded is not the one created by an impostor ? Do you know why your passwords are so safe in a Linux system ?

Whenever you receive any software ­­ that includes documentation and scripts ­­ from any source, it is important to ensure that there are no hidden risks and traps which have been planted by intruders. Or, when you transfer or copy files over a network, you want to ensure that the file has not got modified by transmision errors. When downloading software from online repositories, or when you receive prerecorded software (e.g. on a CDROM) from any source, it is important to consider the possibility that the site may have been compromised. One of the threats that users face, is that intruders could include malicious code in the software packages distributed by those sites. This code could take the form of Trojan horse programs, or backdoors. In large collections of files, intruders could slip in their own files containing malicious programs, or they could modify files which contain important material, particularly, intrusion detection software or procedures. In the simplest of cases, they could replace your material with material which would embarass you or damage your reputation. You want an effective and simple solution, to protect you in all such scenarios.

The answer to all these questions lies in a simple idea called “checksum”.

A checksum is something like your fingerprint. Technically speaking, a “checksum” (also known as a hash digest) is a form of redundancy check. This is a simple way to protect the integrity of data, by detecting errors (modifications) in data that are sent through space (telecommunications) or time (storage). A redundancy check, like the name implies, is the approach of adding redundant information to the data, such that any modificiation or alteration of the data can be detected (in theory) by just looking at the checksum. Note that this is just a way of ensuring that your data has not been modified. It does not automatically lead you to the modifications actually made. Nor does it prevent data being modified by unscrupulous agents. Nor will it tell you modified the data and when. Now, this how it works. Let us say you have a file X which you want to protect with a checksum. You can use some algorithm of your own, and generate a checksum x which is derived from X. You append x to X (whenever you send the file X). the receiver will use the same algorithm you used, on the file X and get his checksum say y. If X has not been altered, x should be the same as y. If they are not, you can suspect some mischief somewhere from the time X was created, and till it was received by you.

Let us take a simple, childish, example. Let us asume that your X is the file containg this article. Let us asume that we count all the characters (including punctuation marks) in this file, say x1. Count the number of spaces, say x2. Now concatenate x2 with x1, to get x1x2. Now, when we send X we will send x1x2 also, along with X. It is just a small operation for the receiver to ensure that the file has not been altered, by just recomputing the number of characters (say y1), and the blank spaces (say y2) in his copy of X. He can then compare y1y2 with x1x2 and out if the file X has been altered before reaching him.

Of course, this approach has a lot of glaring loopholes. For instance, if the miscreant, replaces one character by another, his mischief will never get caught. Or if he changes X, and also recomputes the checksum, and sends the recomputed checksum, the mischief will go unnoticed. Or, if he adds one blank space at one place, and removes a blank space from another place (compensating errors), he will never get caught. Do not despair. There are powerful schemes and extensions to checksums that are immune to such mischiefs.

Checksums also come in handy in many other ways too. For instance if you have two jpeg files of the same image. Visually inspecting the two images will not show you any differences if the difference is at a pixel level (a picture may consist of several thousands of pixels). If you have a thousand similar looking images, it would not be easy to point out which ones are duplicates of each other. A simple way to solve this problem, will be to compute the checksum of each jpeg file. Now by comparing the checksums, you can get to recognise duplicates easily. The slightest difference in the images will show up us a different checksum for the jpeg file. Comparing pictures visually, will not always show differences, if any. Also, when the number of files is large, such a task would be too cumbersome for any human.

Your own Linux machine uses a clever combination of checksum (md5/sha) and encryption (one way encryption plus Data Encryption Standard), to store your Linux password securily. When dealing with passwords, there are three things we are protecting against: ­ storing the password (can someone sneak into your machine and steal the password ?), transmission of password (example when logging in over the web), and the replay of the password (example, verifying the password entered by a user during login). Hash functions come in handy in all these cases. You can find out how this is done, by reading any good book on Linux/Unix internals. By storing passwords using a hash digest, even if your password file is compromised, the passwords cannot be decoded. The only disadvantage will be that users cannot retrieve lost passwords, they must reset them.

Some important properties of good checksums are: 1. Two different sets of data will always give different checksums (Two different persons will have different fingerprints) 2. It is impossible (or extremely difficult) to reconstruct the original data set from the checksum of that data set (Can you get a person's photograph, just from his fingerprint ? If you could, you can help our Police enormously) 3. A given data set will always lead to a unique checksum (the same person cannot have two fingerprints for the same finger) 4. Just like making a fingerprint of a person, computing the checksum for any arbitrary data set should be feasible and relatively easy (read efficient).

Two of the most popular checksum algorithms (also known as hash digest algorithms or hash functions) are: sha, and md5. Of course there are other hash functions also available. Each has its own strengths and weaknesses. For a good briefing on hash functions, visit “the hash function lounge”, at: http://paginas.terra.com.br/informatica/paulobarreto /hflounge.html Another interesting site about hash digests is : http://www.hashemall.com/ Hashing is closely related to cryptography, and is the basis of a technique called “digital signature”. A digital signature itself is an encrpted form of a hash digest. sha

According to wikipedia ­­ “SHA stands for Secure Hash Algorithm. Hash algorithms compute a fixed­length digital representation (known as a message digest) of an input data sequence (the message) of any length. The term SHA collectively denotes five cryptographic hash functions designed by the National Security Agency (NSA) and published by the NIST as a U.S. Federal Information Processing Standard. “ The original specification of the algorithm was published in 1993 as the Secure Hash Standard, FIPS PUB 180, by US government standards agency NIST (National Institute of Standards and Technology). This version is now often referred to as "SHA0". SHA­0 was withdrawn by the NSA shortly after publication and was superseded by the revised version, published in 1995 in FIPS PUB 180­1 and commonly referred to as "SHA1". The youngest in this series is SHA­512 and was born in 2000. The new hash functions SHA­224, SHA­256, SHA­384, and SHA­512, are collectively called the SHA­2 family. The Secure Hash Signature Standard (SHS) (FIPS PUB 180­2) specifies four secure hash algorithms ­ SHA­1, SHA­256, SHA­384, and SHA­512 – for computing a condensed representation of electronic data (message). When a message of any length < 264 bits (for SHA­1 and SHA­256) or < 2128 bits(for SHA­384 and SHA­512) is input to an algorithm, the result is an output called a message digest. Message digests range in length from 160 to 512 bits, depending on the algorithm.

Notice that in all the above, for a given message (of any arbitrary length), and a given SHA function, the length of the digest is fixed. This makes it easy to strip out the hash digest from the received “padded message” (message plus digest). md5

MD5 (message digest 5) was designed by Ronald Rivest in 1991 (the “R” in the famous technique called RSA cryptography). The MD5 homepage at http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html gives details of this tool. MD5 replaces an earlier hash function: MD4 (also created by Ronald Rivest). MD5 is more secure than MD4. However a number of weaknesses of MD5 have been found in recent years. A recent paper, published in this area, claims that a collision of MD5 can be found within one minute on a standard PC, using a method called tunneling. Despite its weaknesses, MD5 is widely used in digital signature processes. It has been implemented in many programming languages. MD5 algorithm is described in RFC 1321 (see http://www.ietf.org/rfc/rfc1321.txt). The algorithm takes a message of arbitrary length and produces a 128­bit message digest. MD5 is a very popular scheme, and is used in SSL, PGP, HTTP authentication, Tripwire, and many other places. MD5 hash is commonly used to verify the integrity of files (i.e., to verify that a file has not changed as a result of file transfer, disk error, meddling, etc.).

In general, sha is considered to be more secure than md5. But, sha is slower than md5. All this is still a matter of research and work­arounds, so the comparison can be misleading.

Sometimes, these tools are referred to as encryption tools. This is technically not correct. Encryption always involves scrambling the original text, to make it illegible to an unauthorised agent. Hash digests do not scramble the original text/message per se. It is a different matter that you may decide to add another layer of security. You may choose to do some encryption, in addition to computing a checksum/hash digest.

Linux and hash functions

In addition ot using hash digests (sha, md5) for password handling, Linux gives you commands to help you create hash digests or verify hash digests. The program is installed by default in most Unix, Linux, and Unix­like operating systems.. In fact, you would have noticed that many Linux distribution CDs give you md5 checksums for all the files on the CD. Hereis an extract of the MD5 sums of all files in the Ubuntu distro DVD which was supplied with the June 2008 issue of LFY:

7953641146f3103b8f8a77dc85a15060 ./casper/filesystem.manifest 62d20d5168ce34daee90cb5ece46659c ./casper/filesystem.manifest-desktop 7f659791714bf00a645972ed46ccb710 ./casper/initrd.gz 3f5a29371c9a3032a968895ce5c8ad55 ./casper/filesystem.squashfs 1bf6dca81a4496dd2c29d517b30f087a ./casper/vmlinuz c09db48c645f089953612515ecf6d920 ./dists/hardy/Release dec458a3f731075379024c2e64839471 ./dists/hardy/restricted/debian-installer/binary- i386/Packages.gz 2b73168631a2138bae2aa21df400b9a6 ./dists/hardy/restricted/debian-installer/binary- i386/Packages 583ab564952d4ce47b887c8b5b0ea30f ./dists/hardy/restricted/binary-i386/Packages.gz 285c43d848f34ba43ce2e23414e22cf9 ./dists/hardy/restricted/binary-i386/Release b0861b853731d9b87e952a4e6ea885bb ./dists/hardy/restricted/binary-i386/Packages eef8026473fd7fcd3e23cd0cf7ea2ca9 ./dists/hardy/Release.gpg 1422cfaedc2f8fd49d92489bbf240c08 ./dists/hardy/main/debian-installer/binary- i386/Packages.gz 9cfc07519ad901db6fdc6cdf508e4120 ./dists/hardy/main/debian-installer/binary- i386/Packages b3fcfd7f4a904026bdd1d2778553dbe4 ./dists/hardy/main/binary-i386/Packages.gz b5c09269e28533a41ce04df0fd4cff0c ./dists/hardy/main/binary-i386/Release 6948121bf88720189ed27fa8fc115b18 ./dists/hardy/main/binary-i386/Packages

The second field in each of the above, is the name/path of the file. the first field is the md5 of that file.

The Linux command “md5sum” can be used to generate MD5 message digests. It can also be used to check the MD5 digests of files. Linux also gives similar commands for sha (sha1sum sha256sum sha512sum sha224sum sha384sum). With a little practice and experimentation, you will easily be able to perform many tricks with these tools. You can put md5sum or shaXsum into shell scripts to get more elaborate usages. Combining these with GPG (an encryption/decryption package), you can create digital signatures and really protect all your data. In fact, since GPG comes bundled for free, with Linux distros, it is easy to manage message digests and digital signatures on your own.

With the GPG program ( http://www.gnupg.org/docs.html ) you can digitally "sign" files, and ensure tamper detection easily. But, the catch is, with GPG you can "sign" only one file at a time. With MD5 you can create checksums of several files at a time (using a small one­line script) But, the catch is that if an intruder can tamper the file, he can also tamper the md5 checksum, and destroy all traces of his mischief.

So, we use both these tools in tandem. First, we compute the MD5 checksums of all the files in the directory "/pypath/myfiles" ­­ these MD5 checksums are stored in the file: , say “allfiles.md5” . We then "digitally" sign this md5 file using GPG and a GPG key This key is a secret, and is known only to the person who created it. Any modification to any of the files is detected by verifying the md5 checksum. To verify that these md5 checksums are okay, use the command : md5sum -c allfiles.md5 Any modification to the md5 checksum file gets detected by verifying, using GPG, the digital signature of the checksum file.

Here is an example of a sha and an md5 digest of an earlier version of this article (checksums.odt): 919932f8244cc06a0c7135b3f9dc2301 checksums.odt 5a58e39daa45e645dee4c515b11251254844d46d checksums.odt

These checksums were created using the sha1sum and md5sum commands of Linux. The md5 digest was stored in a file called sum. We then modified the source file (checksums.odt) slightly, and issued the md5 checking command: md5sum -c sum.This is what we got : checksums.odt: FAILED md5sum: WARNING: 1 of 1 computed checksum did NOT match

You can try out the above experiment yourself, and get an idea of how md5sum works. It is easy. It is also possible to create and check the hash digests of several files, using a small script.

Thanks to all these features, you feel very secure when working with Linux.

Closing remarks

This article was a quick overview of checksums (hash digests). This subject is very profound mathematically. We have tried to avoid all the maths, and have made some simplifications in the presentation. Checksums form an ideal starting point for ensuring security. You can create an elaborate security arrangement, by cleverly combining checksums with digital signatures or encryption (or both). A good book on cryptography will give you details which we have not covered (deliberately) in this introductory article.

About the Author

Parthasarathy is an aggressive supporter of FOSS. He teaches discrete mathematics, and preaches LaTeX and Linux, to students of Computer Science. His website : http:\\ algolog.tripod.com\nupartha.htm will give more specific details about him. His contact address is : [email protected]

***********end