The Unbearable Lightness of Bit: June 2007

MD5FRACT: partial md5 checksum proposal

Though BitTorrent is nowadays a preferred protocol to share and download big files like Linux distribution iso images, http:// and ftp:// protocols are definitely still alive and well used. Although they already have their integrity check algorithms, different kind of errors may occur at different levels of the protocol stack, resulting in corrupted data saved on your hard disk. Here come to help us MD5 and SHA1 algorithms: reasonably fast hash functions useful to check the integrity of the files we downloaded.

What happens if we detect a file is corrupted? We have to download it again. The whole file, despite its size. While this is not a problem for most high-speed connection users, we shouldn't forget that many people can't access yet the internet in a very fast way (or with a "flat" rate). Any way, for everyone (service providers, high and low speed connection users) a such situation leads to a waste of time, money and bandwidth.

What could we do to reduce the negative impact of these common situations?

A cheap and fast solution could be to use partial checksums, as most peer-to-peer protocols already do. If there's only one wrong byte in my file, why should I download the entire file again? Both http:// and ftp:// protocols support file download resuming with random access¹, so that we can download again just the corrupted part.

To achieve this goal, my proposal is to use a replacement of the common md5sum *nix command allowing partial checksumming.
I wrote a possible replacement (a bash script) which uses dd and md5sum to checksum just file chunks, instead of the entire given file. The aim of the script is to be completely compatible with md5sum: the replacement should be almost transparent for system administrators, and should recognize "standard" .md5 files and pass them directly to the standard md5sum utility. New .md5 files would have .md5f extension, because informations on partial checksum can't be compatible with normal .md5 files.

The script version is 0.5, and it's just a demonstration script - not the final version. It's capable of checksumming partial chunks of the given file, and checking the correctness of a file given a .md5f checksum list. It's open source, you can take it and modify it and redistribute it; just cite me and/or this blog, please. See below for download link.

Its functioning is really simple: dd reads a chunk of the given files and passes it on a pipe to md5sum; the hash is written to stdout. Here's an example about how it works:

eu@aer:/d/progetti/md5fract$ ./md5fract.sh Gatc.avi
e54b1acf481f307ae22ac32bbc6ce5df 1:Gatc.avi
a04871e4362c38c1243b2dd165bcfa07 2:Gatc.avi
9f1721ff9bac5facb986cc0964e24a60 3:Gatc.avi
a8230976234a97a4e11465eb1bb850d6 4:Gatc.avi
e8f3ef6ffa56292dfae0dc50f06712b5 5:Gatc.avi
368f6d1ae0106c49b71e2d9c0ab05e96 6:Gatc.avi
66e3d9107c0cf40dbafa09bb6cac38a3 7:Gatc.avi
7f8fd01e16b6900661f8d4aac2cee7f5 8:Gatc.avi
1eb25dc5d3e239ba247c94f1614588fd 9:Gatc.avi
5f96e9f4e7fdf92172a8f36d265a5070 10:Gatc.avi
eu@aer:/d/progetti/md5fract$ ./md5fract.sh Gatc.avi &> Gatc.avi.md5f
eu@aer:/d/progetti/md5fract$ ./md5fract.sh --check Gatc.avi.md5f
Gatc.avi: OK (10)
eu@aer:/d/progetti/md5fract$

If we modify one of the hashes, we get this error message:

eu@aer:/d/progetti/md5fract$ ./md5fract.sh --check Gatc.avi.md5f
Gatc.avi: FAILED
Wrong hash at line 8 for file Gatc.avi:
Chunk:           8 (146800640 ... 167772160)
Calculated hash: 7f8fd01e16b6900661f8d4aac2cee7f5
Found hash:      9f8fd01e16b6900661f8d4aac2cee7f5
eu@aer:/d/progetti/md5fract$

I was first thinking that an executable file implementing its own md5sum routine would have had a better performance, because using dd for each chunk implies opening a file handle, seeking inside the file and starting a new md5sum process for each chunk; however, a quick performance comparison shows that, thanks quick random file access of modern filesystems, this bash script with redundant file opening is almost as fast as the traditional md5sum program launched with the same file:

eu@aer:/d/progetti/md5fract$ time ./md5fract.sh DevAdvocate.avi
49980c46641915c55252772dc4933090 1:DevAdvocate.avi
7bc34fa302d6fee588eb06421fd529c0 2:DevAdvocate.avi
7c521d571165f4224693396f378e5001 3:DevAdvocate.avi
34975d9be25a75d7e39cc882db9e0ca4 4:DevAdvocate.avi
76e5e56e8972656ef17f1b2c429b3695 5:DevAdvocate.avi
cc32be51ba083482bb4b3e9d2143eb00 6:DevAdvocate.avi
d5ce8ae2027288dfe182276ce2143a69 7:DevAdvocate.avi
5037475202e4ae2682434c612e11b6a8 8:DevAdvocate.avi
0dc569188a14c74c9e6807a05d9af1f6 9:DevAdvocate.avi
a9ecf0210a55e82349939c11d21272b4 10:DevAdvocate.avi

real    1m5.386s
user    0m5.256s
sys     0m6.536s
eu@aer:/d/progetti/md5fract$ time md5sum DevAdvocate.avi
c61c60414ba0042169d0caf0880c2610  DevAdvocate.avi

real    0m56.813s
user    0m5.260s
sys     0m1.912s
eu@aer:/d/progetti/md5fract$

The user time is basically identical, but the system time (necessary to open file handles and new subprocesses more than once, I suppose) is more than doubled. Anyway, on my machine the time required for md5fract execution is always around 110%-115% of the time required by md5sum to checksum the same file (file "DevAdvocate.avi" is about 1.4 Gb). In conclusion, checksumming partial file chunks through a bash script seems to have not a bad performance. This test was done on ext3 filesystem, but would be useful to do more tests on different filesystems (that may have different file seeking speed).

The aim for version 1.0 would be to have the same script with:

Multiple files support (now it supports just one input file);
Total compatibilty with existent md5sum utility;
Non-bash shells compatibility;
Better error handling.

If you're tired of downloading the same files again and again, or if you simply like the concept, please spread this idea. I'll complete it sooner or later (sooner if I see some admins are interested in it), but of course anyone else could complete it ;). After completion, I'd like to propose it to several mirror services, and I hope someone will adopt it.

In a nutshell:

MD5 partial checksum utility (md5fract) for bash, version 0.5
Requires md5sum already installed
Download link (less than 6K)

¹: Unfortunately, some FTP clients use just a signed 32 bit integer to seek remote files; as a consequence, they can't seek addresses higher than 2 Gb.

A CSS style to post code in your blog

So, you own a blog and you'd like to post some code. Is it HTML, CSS, PHP, Java, ...? You need to highlight and format it, somehow. For example, here is the code to enumerate an associative array in PHP:

<?php
 echo " Array $arr:<br>\n";
 foreach ($arr as $key => $value) {
  echo " - Key: $key -&gt; $value<br>\n";
 }
?>

First of all you need a parser to highlight the code. Most of good editors (e.g. Kate, SciTE) have an "export as HTML" feature, and that's what you need. After exporting the code to HTML, you need to format the paragraph some how in a way you like (tipically with a monospaced font, and often with a special background color). Another good thing would be a horizontal scrollbar appearing only when needed, without the normal word-wrap which is a bit invasive for code quotes.
If you like the CSS style of the above paragraph, I can avoid you the effort to look for it in the source of this page; here's the exact same CSS I use in this blog (I just added some :hover border change):

 /* CSS for code quotes by http://binaryunit.blogspot.com
by Eugenio Rustico */
.code{
 width: 60%;
 background: #EEE;
 border-width: 1px;
 border-color: #CCC;
 border-style: solid;
 border-left-width: 4px;
 border-left-color: #000066;

 font-family: Courier, monospace;
 line-height: 115%;
 clip : auto;
 overflow : auto;

 padding: 1em;
 padding-left: 1.5em;
 margin: 1em;
 margin-left: 2em; 
 margin-right: 2em;

 /* white-space:nowrap; if you want DIV instead of PRE */ 
}

This class should be used in a <PRE> or in a <DIV> element, depending on your needs. If you don't have to format something else inside the same block with HTML code, just use a <PRE> tag; but remember that every space or tab inside the text will be "as is" in the output (the HTML page). License: just cite this blog in the comments or somewhere else, and you can use and abuse of it =)

Little note: due to Blogspot's buggy post editor, I had to add a non-breakable space ( ) at the beginning of the margin-right row. Unfortunately, every space in the end of a line in the code editor, even if the line is broken for word-wrap, is considered useless and is trimmed out... Without this addiction, the margin-right row is not aligned with the other rows.
Moreover, every time you write the " " string in the HTML editor, and you edit the post again with the wysiwyg editor, it magically disappears and you have to edit the HTML code again... Buggy, buggy Blogspot!

The icing on the cake: If your code is very long (e.g. over 50 lines) but you don't want a such high box in your blog, just add the CSS max-height property, preferably to a value expressed in em (e.g. to about 30em) and a vertical scroll bar will appear if needed.

Forcing video preview frame in YouTube

Many people noticed YouTube flash player is not so rich in options, and one of its the lacks is that it's not possible to choose the preview frame for an uploaded video; not a static image, neither a frame from the video itself. However, if you absolutely have the need to choose the preview frame, you should know YouTube takes as preview the frame at about the temporal half of the video. This is not an official information, but just the result of some observation: maybe for some videos the choosing "algorithm" is different.

In my expreriences (the most beautyful video I uploaded, that's nevertheless a poor one, is this), the frame at the exact half of total frames number is not the chosen preview; it seems that the algorithm is:

Take the temporal length of the video;
Divide by 2, and truncate to the shortest integer;
Choose previous frame, or the current one if it's exactly the moment to change frame (if 1/framerate divides the length).

I made two videos to confirm this hypothesis, with labels in each frame indicating the current position:

As you can see, the choosen frame is not the exact half. 500 frames at 25 frames/second are 20 seconds, and 510 frames are 20.4 seconds. In both cases, frame number 251 (1-based) is choosen; 0.4 seconds more make no influence at all.
If you want to do more tests by yourself, you can use ImageMagick go generate the frames and MEncoder to merge them into a video. Here is the bash script I used to generate these videos.

Interlacing makes things more complicated, because there's no way to know in advance how YouTube converting engine will work; and, of course, we can not choose in advance useful video parameters (like deinterlacing method). So, try to upload an already deinterlaced video, if possible; Flash streams are progressive, and would be a good choice not to let YouTube decide how to deinterlace.

Update: by downloading again my same video, I noticed that its length has changed, though the original submission was already progressive. The conclusion is: I can't find out the algorithm YouTube uses to choose the preview, it would need more experiments and I don't have the time to do them.

But is this just a useless toy?
In my opinion, it's not just a toy, but a powerful trick. Consider, for example, this video:

The preview is absolutely arbitrary, and has *nothing* to do with the content of the video. During the first two hours after I uploaded it, 239 views were hit. Without this "special preview", the same video was viewed 2 times in a week... Tricks for cheating people apart, this technique can be used to set a decent preview to your videos; the default YouTube's algorithms produces lots of annoying previews, that just don't match well the real content of the videos. That was why I had the initial idea: in one of my videos, the chosen preview was a frame almost completely black, while the rest of the video was white.

As you probably noticed, poor seeking is another annoying fact in YouTube videos; this time, the injection of keyframes is related also to scene changes, so if you move the "preview frame" one or two or three position before, it's not the preview anymore but it's still a keyframe (I did this experiment). To do: upload a .flv video already encoded with many keyframes, to test if YouTube is going to delete some of them or not.

Now, if YouTube chose for your video an awful preview, you know how to force it to use a better one. Power to masses! But remember: from a great power, come great responsabilities... Try to imagine what spammers could do with this trick... Oouch!