MD5FRACT: partial md5 checksum proposal

Though BitTorrent is nowadays a preferred protocol to share and download big files like Linux distribution iso images, http:// and ftp:// protocols are definitely still alive and well used. Although they already have their integrity check algorithms, different kind of errors may occur at different levels of the protocol stack, resulting in corrupted data saved on your hard disk. Here come to help us MD5 and SHA1 algorithms: reasonably fast hash functions useful to check the integrity of the files we downloaded.

What happens if we detect a file is corrupted? We have to download it again. The whole file, despite its size. While this is not a problem for most high-speed connection users, we shouldn't forget that many people can't access yet the internet in a very fast way (or with a "flat" rate). Any way, for everyone (service providers, high and low speed connection users) a such situation leads to a waste of time, money and bandwidth.

What could we do to reduce the negative impact of these common situations?

A cheap and fast solution could be to use partial checksums, as most peer-to-peer protocols already do. If there's only one wrong byte in my file, why should I download the entire file again? Both http:// and ftp:// protocols support file download resuming with random access1, so that we can download again just the corrupted part.

To achieve this goal, my proposal is to use a replacement of the common md5sum *nix command allowing partial checksumming.
I wrote a possible replacement (a bash script) which uses dd and md5sum to checksum just file chunks, instead of the entire given file. The aim of the script is to be completely compatible with md5sum: the replacement should be almost transparent for system administrators, and should recognize "standard" .md5 files and pass them directly to the standard md5sum utility. New .md5 files would have .md5f extension, because informations on partial checksum can't be compatible with normal .md5 files.

The script version is 0.5, and it's just a demonstration script - not the final version. It's capable of checksumming partial chunks of the given file, and checking the correctness of a file given a .md5f checksum list. It's open source, you can take it and modify it and redistribute it; just cite me and/or this blog, please. See below for download link.

Its functioning is really simple: dd reads a chunk of the given files and passes it on a pipe to md5sum; the hash is written to stdout. Here's an example about how it works:

eu@aer:/d/progetti/md5fract$ ./ Gatc.avi
e54b1acf481f307ae22ac32bbc6ce5df 1:Gatc.avi
a04871e4362c38c1243b2dd165bcfa07 2:Gatc.avi
9f1721ff9bac5facb986cc0964e24a60 3:Gatc.avi
a8230976234a97a4e11465eb1bb850d6 4:Gatc.avi
e8f3ef6ffa56292dfae0dc50f06712b5 5:Gatc.avi
368f6d1ae0106c49b71e2d9c0ab05e96 6:Gatc.avi
66e3d9107c0cf40dbafa09bb6cac38a3 7:Gatc.avi
7f8fd01e16b6900661f8d4aac2cee7f5 8:Gatc.avi
1eb25dc5d3e239ba247c94f1614588fd 9:Gatc.avi
5f96e9f4e7fdf92172a8f36d265a5070 10:Gatc.avi
eu@aer:/d/progetti/md5fract$ ./ Gatc.avi &> Gatc.avi.md5f
eu@aer:/d/progetti/md5fract$ ./ --check Gatc.avi.md5f
Gatc.avi: OK (10)

If we modify one of the hashes, we get this error message:

eu@aer:/d/progetti/md5fract$ ./ --check Gatc.avi.md5f
Gatc.avi: FAILED
Wrong hash at line 8 for file Gatc.avi:
Chunk: 8 (146800640 ... 167772160)
Calculated hash: 7f8fd01e16b6900661f8d4aac2cee7f5
Found hash: 9f8fd01e16b6900661f8d4aac2cee7f5

I was first thinking that an executable file implementing its own md5sum routine would have had a better performance, because using dd for each chunk implies opening a file handle, seeking inside the file and starting a new md5sum process for each chunk; however, a quick performance comparison shows that, thanks quick random file access of modern filesystems, this bash script with redundant file opening is almost as fast as the traditional md5sum program launched with the same file:

eu@aer:/d/progetti/md5fract$ time ./ DevAdvocate.avi
49980c46641915c55252772dc4933090 1:DevAdvocate.avi
7bc34fa302d6fee588eb06421fd529c0 2:DevAdvocate.avi
7c521d571165f4224693396f378e5001 3:DevAdvocate.avi
34975d9be25a75d7e39cc882db9e0ca4 4:DevAdvocate.avi
76e5e56e8972656ef17f1b2c429b3695 5:DevAdvocate.avi
cc32be51ba083482bb4b3e9d2143eb00 6:DevAdvocate.avi
d5ce8ae2027288dfe182276ce2143a69 7:DevAdvocate.avi
5037475202e4ae2682434c612e11b6a8 8:DevAdvocate.avi
0dc569188a14c74c9e6807a05d9af1f6 9:DevAdvocate.avi
a9ecf0210a55e82349939c11d21272b4 10:DevAdvocate.avi

real 1m5.386s
user 0m5.256s
sys 0m6.536s
eu@aer:/d/progetti/md5fract$ time md5sum DevAdvocate.avi
c61c60414ba0042169d0caf0880c2610 DevAdvocate.avi

real 0m56.813s
user 0m5.260s
sys 0m1.912s

The user time is basically identical, but the system time (necessary to open file handles and new subprocesses more than once, I suppose) is more than doubled. Anyway, on my machine the time required for md5fract execution is always around 110%-115% of the time required by md5sum to checksum the same file (file "DevAdvocate.avi" is about 1.4 Gb). In conclusion, checksumming partial file chunks through a bash script seems to have not a bad performance. This test was done on ext3 filesystem, but would be useful to do more tests on different filesystems (that may have different file seeking speed).

The aim for version 1.0 would be to have the same script with:
  • Multiple files support (now it supports just one input file);
  • Total compatibilty with existent md5sum utility;
  • Non-bash shells compatibility;
  • Better error handling.
If you're tired of downloading the same files again and again, or if you simply like the concept, please spread this idea. I'll complete it sooner or later (sooner if I see some admins are interested in it), but of course anyone else could complete it ;). After completion, I'd like to propose it to several mirror services, and I hope someone will adopt it.

In a nutshell:
  • MD5 partial checksum utility (md5fract) for bash, version 0.5
  • Requires md5sum already installed
  • Download link (less than 6K)

1: Unfortunately, some FTP clients use just a signed 32 bit integer to seek remote files; as a consequence, they can't seek addresses higher than 2 Gb.

1 comment:

Anonymous said...

Not bad, you can use par2 for that.

Post a Comment