The Unbearable Lightness of Bit: Bash

Showing posts with label Bash. Show all posts

HTML page with pictures

Quick and simple: a short bash script to create a HTML page with all the pictures you have in the current directory.

echo -e "<html><head><title>Icons</title></head><body>\n" > file.html
ls -1 *.png | while read a; do echo -e "<img src='./$a'/>\n" >> file.html; done
echo -e "</body></html>\n" >> file.html

Just add different wildcards for every image format you want to include.

Online vs. offline hashes

Someone asked me why the hash of any string produced by Hash'em'all! is different from the hash of the same string produced in bash by

echo "string" | md5sum

The reason is simple: the "echo" command automatically adds a newline at the end of the input string. The "-n" option tells the command not to add it. So

echo -n "string" | md5sum

will give the same result as Hash'em'all!. So simple... But someone was going crazy for this.

FS-independent bash backup

Dirty and quick, a little script to backup an entire hard drive from bash:

#!/bin/bash
# License: do what you want but cite my blog ;)
# http://binaryunit.blogspot.com
#
# *** superSalva 1.0 by Eugenio Rustico ***
# Backup utility for ALL types of partitions
#
# FEATURES
# - Easy disk/partition image *even of unknown filesystems*
# - On the fly compression: no need for temporary files
# - Customizeable process
#
# LIMITS
# - May be more speedy
# - No fs-specific support
# - Reads and compress even zero-zones
#
# TODO:
# - Support for decompressor without zcat equivalent
# - Support for creating (better if bootable) iso images
# - Support for md5sum integrity verification (!)
# - Free space checking
# - Wizard
# - Final statistics and estimated time
# - Trap for CTRL+C

# Device to be backupped, even if NTFS or unknown. May be a partition or a whole disk
export DEVICE=/dev/hda6

# AUTOMATIC: device capability, in kb
export DEVICE_DIM=`df -k | grep $DEVICE | awk {'print $2'}`

# Actually unused
export DEVICE_FREE=`df -k | grep $DEVICE | awk {'print $4'}`

# Destination/source directory. If destination, should have $(dim of $DEVICE) space free
# Do NOT locate destination on the same drive you're backupping!
# export DIR=/d/ripristino/e
export DIR=/pozzo

# Destination/source base filename. During backup files are overwritten.
export FILENAME=Immagine_E_20_6_2006

# On the fly compression commands. Shoud support reading from stdin and writing on stdout
# DEFAULT: gzip, well-known
# EXPERIMENTAL: 7zip, slower but better compression. YOU MUST HAVE 7zip ALREADY INSTALLED. But does "7cat" exist?
export COMPRESSOR=gzip
export DECOMPRESSOR=zcat

# Compression parameters
# Actually, only a compression level sent to gzip
export COMPRESSOR_PARAMS=-9

# Compressed file extension. Optional but useful
export EXTENSION=gz

# Number of piecese to skip while backuppin/restoring
# Useful for testing and for resuming interrupted backups
# CHANGE THIS ONLY IF YOU KNOW WHAT YOU ARE DOING
# Default: 0
export SKIP=0

# Block size. NOTE: from it depend default piece dimensions!
# CHANGE THIS ONLY IF YOU KNOW WHAT YOU ARE DOING
export BLOCK_SIZE=1024

# Dimension of pieces to compress and backup, in kb.
# Too small = too many pieces, not useful
# Too large = unefficient compression
# MAX = 4194303 (if dest fs is not FAT, may be higher), few pieces
# MEDIUM VALUES:
#   524288 (512 Mb)
#   262144 (256Mb)
#   131072 (128 Mb), many pieces
#   65536  (64 Mb)
#   32768  (32 Mb), definitely too much pieces!
# MIN = 1 (nonsense)
# DEFAULT = 1048576 (1 Gb, RECOMMENDED)
# NOTE: this values are block-size dependent (in this case, we use 1024 bytes blocks)
export PIECE_DIM=32768

# AUTOMATIC: number of pieces
# Should be plus one, but it's zero based so no matter
export NUM=$(($DEVICE_DIM/$PIECE_DIM))

# Action: BACKUP or RESTORE
export ACTION=BACKUP

# Want to see what I'm doing?
export VERBOSE=1

echo
echo
echo " *** superSalva 1.0 *** "
echo
echo
echo "Device: $DEVICE ($DEVICE_DIM kb)"
echo Compressor: $COMPRESSOR
echo Decompressor: $DECOMPRESSOR
echo Parameters: $COMPRESSOR_PARAMS
echo Destination: $DIR/$FILENAME.NUM.TOT.$EXTENSION
echo Device: $DEVICE
echo Pieces: $(($NUM+1)) pieces, $PIECE_DIM kb each
echo Action: $ACTION
echo
echo Ready? CTRL+C to abort, ENTER to start. May take LONG time.
read
echo
echo

export SUM=0
for i in `seq $SKIP $NUM`
do
export FILEN="$DIR/$FILENAME.$(($i+1)).$(($NUM+1)).$EXTENSION"
export SK=$(($i*$PIECE_DIM))
#export
#if [  ]
#then
#fi
if [ "$ACTION" == "RESTORE" ]
then
 echo "*  Decompressing and writing piece $(($i+1)) of $(($NUM+1)) (kb $(($SK+1)) to $((($i+1)*$PIECE_DIM)))..."
 export COMMAND="$DECOMPRESSOR $FILEN | dd of=$DEVICE seek=$SK count=$PIECE_DIM bs=$BLOCK_SIZE"
 if [ $VERBOSE == 1 ]; then echo $COMMAND; fi
 $COMMAND
 echo "*    Successfully restored $FILEN"
 echo
else
 echo "*  Reading and compressing piece $(($i+1)) of $(($NUM+1)) (kb $(($SK+1)) to $((($i+1)*$PIECE_DIM)))..."
 export COMMAND="dd if=$DEVICE skip=$SK count=$PIECE_DIM bs=$BLOCK_SIZE | $COMPRESSOR $COMPRESSOR_PARAMS > $FILEN"
 if [ $VERBOSE == 1 ]; then echo $COMMAND; fi
 $COMMAND
 export LAST_DIM=$((`ls -l $FILEN | awk {'print $5'}`/1000))
 echo "*    Saved $FILEN ($LAST_DIM kb, ratio $(((100*$LAST_DIM)/$PIECE_DIM))%)"
 export SUM=$(($SUM+$LAST_DIM))
 echo
fi
done

echo
if [ $ACTION == "RESTORE" ]
then
echo "Finished. $DEVICE seems to be restored."
else
echo "Finished. $(($NUM+1)) files, tot $SUM kb ($((100*$SUM/$DEVICE_DIM))% of original $DEVICE size)"
fi
echo Bye!
echo

Thanks to dd. Features:

On the fly compression
Non need for temporary files
File-system independent
Backup and restore facilities
Keeps master boot records
...

TODO: lots of things (checksumming facilities, configuring wizard, free space checking, command tests...); the complete TODO list is inside the script.

Organize your pictures by EXIF tags with bash

Just a simple script to order your thousands of pictures by putting them in subdirectories on the basis of EXIF shot date:

#!/bin/bash
# BashButler 1.0 - 11/2007 by Eugenio Rustico
# Bash script to organize your pictures in folders by date
#
# GPL v3 license http://www.gnu.org/licenses/gpl-3.0.html

ls -1 *.jpg | while read fn
do
export dt=`exiv2 "$fn" | grep timestamp | awk '{ print $4 }' | tr ":" "-"`
if test ! -e ./$dt
then
 mkdir $dt
fi
mv "$fn" ./$dt
echo File "$fn" moved in "$dt"
done

echo Good job!

I wrote it to setup 4.900 pictures which were all in the same directory, and it was really fast and useful. My unstoppable sense of humor conceived this amazing script name: Bash Butler. I'm very sorry for this. =)

Notes:

Of course you need exiv2 package on your system, but you can use other cli exif tools by modifying only one row;
With very little changes, you can modify the script to organize your pictures by different EXIF tags (e.g. different folders for different digital cameras);
Filenames with spaces are treated correctly;
By default mv does'nt overwrite destination files: so all remaining files in current directory are copies, and can be deleted.

GPL license v.3, of course.

MD5FRACT: partial md5 checksum proposal

Though BitTorrent is nowadays a preferred protocol to share and download big files like Linux distribution iso images, http:// and ftp:// protocols are definitely still alive and well used. Although they already have their integrity check algorithms, different kind of errors may occur at different levels of the protocol stack, resulting in corrupted data saved on your hard disk. Here come to help us MD5 and SHA1 algorithms: reasonably fast hash functions useful to check the integrity of the files we downloaded.

What happens if we detect a file is corrupted? We have to download it again. The whole file, despite its size. While this is not a problem for most high-speed connection users, we shouldn't forget that many people can't access yet the internet in a very fast way (or with a "flat" rate). Any way, for everyone (service providers, high and low speed connection users) a such situation leads to a waste of time, money and bandwidth.

What could we do to reduce the negative impact of these common situations?

A cheap and fast solution could be to use partial checksums, as most peer-to-peer protocols already do. If there's only one wrong byte in my file, why should I download the entire file again? Both http:// and ftp:// protocols support file download resuming with random access¹, so that we can download again just the corrupted part.

To achieve this goal, my proposal is to use a replacement of the common md5sum *nix command allowing partial checksumming.
I wrote a possible replacement (a bash script) which uses dd and md5sum to checksum just file chunks, instead of the entire given file. The aim of the script is to be completely compatible with md5sum: the replacement should be almost transparent for system administrators, and should recognize "standard" .md5 files and pass them directly to the standard md5sum utility. New .md5 files would have .md5f extension, because informations on partial checksum can't be compatible with normal .md5 files.

The script version is 0.5, and it's just a demonstration script - not the final version. It's capable of checksumming partial chunks of the given file, and checking the correctness of a file given a .md5f checksum list. It's open source, you can take it and modify it and redistribute it; just cite me and/or this blog, please. See below for download link.

Its functioning is really simple: dd reads a chunk of the given files and passes it on a pipe to md5sum; the hash is written to stdout. Here's an example about how it works:

eu@aer:/d/progetti/md5fract$ ./md5fract.sh Gatc.avi
e54b1acf481f307ae22ac32bbc6ce5df 1:Gatc.avi
a04871e4362c38c1243b2dd165bcfa07 2:Gatc.avi
9f1721ff9bac5facb986cc0964e24a60 3:Gatc.avi
a8230976234a97a4e11465eb1bb850d6 4:Gatc.avi
e8f3ef6ffa56292dfae0dc50f06712b5 5:Gatc.avi
368f6d1ae0106c49b71e2d9c0ab05e96 6:Gatc.avi
66e3d9107c0cf40dbafa09bb6cac38a3 7:Gatc.avi
7f8fd01e16b6900661f8d4aac2cee7f5 8:Gatc.avi
1eb25dc5d3e239ba247c94f1614588fd 9:Gatc.avi
5f96e9f4e7fdf92172a8f36d265a5070 10:Gatc.avi
eu@aer:/d/progetti/md5fract$ ./md5fract.sh Gatc.avi &> Gatc.avi.md5f
eu@aer:/d/progetti/md5fract$ ./md5fract.sh --check Gatc.avi.md5f
Gatc.avi: OK (10)
eu@aer:/d/progetti/md5fract$

If we modify one of the hashes, we get this error message:

eu@aer:/d/progetti/md5fract$ ./md5fract.sh --check Gatc.avi.md5f
Gatc.avi: FAILED
Wrong hash at line 8 for file Gatc.avi:
Chunk:           8 (146800640 ... 167772160)
Calculated hash: 7f8fd01e16b6900661f8d4aac2cee7f5
Found hash:      9f8fd01e16b6900661f8d4aac2cee7f5
eu@aer:/d/progetti/md5fract$

I was first thinking that an executable file implementing its own md5sum routine would have had a better performance, because using dd for each chunk implies opening a file handle, seeking inside the file and starting a new md5sum process for each chunk; however, a quick performance comparison shows that, thanks quick random file access of modern filesystems, this bash script with redundant file opening is almost as fast as the traditional md5sum program launched with the same file:

eu@aer:/d/progetti/md5fract$ time ./md5fract.sh DevAdvocate.avi
49980c46641915c55252772dc4933090 1:DevAdvocate.avi
7bc34fa302d6fee588eb06421fd529c0 2:DevAdvocate.avi
7c521d571165f4224693396f378e5001 3:DevAdvocate.avi
34975d9be25a75d7e39cc882db9e0ca4 4:DevAdvocate.avi
76e5e56e8972656ef17f1b2c429b3695 5:DevAdvocate.avi
cc32be51ba083482bb4b3e9d2143eb00 6:DevAdvocate.avi
d5ce8ae2027288dfe182276ce2143a69 7:DevAdvocate.avi
5037475202e4ae2682434c612e11b6a8 8:DevAdvocate.avi
0dc569188a14c74c9e6807a05d9af1f6 9:DevAdvocate.avi
a9ecf0210a55e82349939c11d21272b4 10:DevAdvocate.avi

real    1m5.386s
user    0m5.256s
sys     0m6.536s
eu@aer:/d/progetti/md5fract$ time md5sum DevAdvocate.avi
c61c60414ba0042169d0caf0880c2610  DevAdvocate.avi

real    0m56.813s
user    0m5.260s
sys     0m1.912s
eu@aer:/d/progetti/md5fract$

The user time is basically identical, but the system time (necessary to open file handles and new subprocesses more than once, I suppose) is more than doubled. Anyway, on my machine the time required for md5fract execution is always around 110%-115% of the time required by md5sum to checksum the same file (file "DevAdvocate.avi" is about 1.4 Gb). In conclusion, checksumming partial file chunks through a bash script seems to have not a bad performance. This test was done on ext3 filesystem, but would be useful to do more tests on different filesystems (that may have different file seeking speed).

The aim for version 1.0 would be to have the same script with:

Multiple files support (now it supports just one input file);
Total compatibilty with existent md5sum utility;
Non-bash shells compatibility;
Better error handling.

If you're tired of downloading the same files again and again, or if you simply like the concept, please spread this idea. I'll complete it sooner or later (sooner if I see some admins are interested in it), but of course anyone else could complete it ;). After completion, I'd like to propose it to several mirror services, and I hope someone will adopt it.

In a nutshell:

MD5 partial checksum utility (md5fract) for bash, version 0.5
Requires md5sum already installed
Download link (less than 6K)

¹: Unfortunately, some FTP clients use just a signed 32 bit integer to seek remote files; as a consequence, they can't seek addresses higher than 2 Gb.