Smoke – A unified means of generating, transmitting, encapsulating, and validating multiple hash digests simultaneously to replace existing stand-alone hash digest software. The software generates digests in parallel and is notably faster than using individual algorithms serially on large files. Smoke operates much the same way as existing hash digest tools, like md5sum, and Smoke designed to be a full replacement.

Impetus

An important component of Information Security is the CIA triad: Confidentiality, Integrity, and Availability. Today, I talk about file integrity, weak hashing, and how it affects some of my clients. The clients using legacy hash validation algorithms like md5 are subject to which maintaining backward compatibility and speeding up hash digest generation, so I created a new mechanism for hash digest management: Smoke.

I always recommend using stronger hashing algorithms since both md5 and sha1 are subject to collisions. Even for small files, as demonstrated here, or for valid files in complex formats, such as in these jpeg images. For the large data dumps or the small files of financial directives, a injected collision could have disastrous consequences for my clients and their customers.

My clients distribute the set of files to multiple of their customers. These files can be quite different: one client sends very large multi-terabyte data files and other clients send smaller files containing transactional commands. For a variety of legacy and mainframey reasons, the downstream customers of our clients only use older hashing algorithms, like md5 and sha1, for validating the files were correctly transferred. My clients would like to force their customers to upgrade, but these downstream systems do not support the more secure hash validation mechanisms.

For the client who sends large files out, the pseudo-code use in their system is essentially:

cat giant-file | md5 > giant-file.md5
cat giant-file | sha1sum > giant-file.sha1
cp giant-file* /sftp-share-folder

cat giant-file | md5 > giant-file.md5

cat giant-file | sha1sum > giant-file.sha1

cp giant-file* /sftp-share-folder

The clients’ customers are finally demanding better hashing algorithms. Unfortunately, downstream Customer A wants to move to sha256 while Customer B desires sha512. Yet Customers C & D still needs legacy sha1 and md5 for the foreseeable future, so these need to stay too. A lazy code change to add the two new algorithms might result in something like:

cat giant-file | md5           > giant-file.md5
cat giant-file | shasum        > giant-file.sha1
cat giant-file | shasum -a 256 > giant-file.sha256
cat giant-file | gsha512sum    > giant-file.sha512

cat giant-file | md5 > giant-file.md5

cat giant-file | shasum > giant-file.sha1

cat giant-file | shasum -a 256 > giant-file.sha256

cat giant-file | gsha512sum > giant-file.sha512

But, reading a massive file of multi-gigabytes to a few terabytes in size four times sequentially from disk is an inefficient means of computing all of these sums. But doing this below won’t help either:

cat giant-file | md5           > giant-file.md5     &
cat giant-file | shasum        > giant-file.sha1    &
cat giant-file | shasum -a 256 > giant-file.sha256  &
cat giant-file | gsha512sum    > giant-file.sha512  &

cat giant-file | md5 > giant-file.md5 &

cat giant-file | shasum > giant-file.sha1 &

cat giant-file | shasum -a 256 > giant-file.sha256 &

cat giant-file | gsha512sum > giant-file.sha512 &

That might actually be worse as amazing amounts of disk head thrash will occur when one of the more complex hash process falls behind the others when the disk cache becomes flushed. Ideally, we want:

while not giant-file.eof()
    data = giant-file.read(1MB buffer)
    md5.update(data)
    sha1.update(data)
    sha256.update(data)
    sha512.update(data)

echo md5.hexdigest    > giant-file.md5
echo sha1.hexdigest   > giant-file.sha1
echo sha256.hexdigest > giant-file.sha256
echo sha512.hexdigest > giant-file.sha512

while not giant-file.eof()

data = giant-file.read(1MB buffer)

md5.update(data)

sha1.update(data)

sha256.update(data)

sha512.update(data)

echo md5.hexdigest > giant-file.md5

echo sha1.hexdigest > giant-file.sha1

echo sha256.hexdigest > giant-file.sha256

echo sha512.hexdigest > giant-file.sha512

Now we’re talking a real program. So, I wrote one for a these clients.

Existing Software

I looked for a simple, pre-existing tool to generate these multiple hashes. Fancy tee commands are suggested in this article, but it gets really complicated with four hash algorithms and using /proc dependencies. A simple python script for multiple hash shown there too, but the script only dumps the hash – there is no validation.

Another bit of software called Quickhash will also generate and compare file hashes for multiple algorithms, but it does not have a command line or API interface. The HashDeep suite does contain command line and API, but after consideration of HashDeep’s complexity and implementation, it was not used. It is probably possible to extend HashDeep to implement all features of Smoke and remove the need for a Python install; but it would be a complex undertaking.

The Python Snakeoil project does perform threaded hashing using multiple algorithms, but contains no file checksum verification functionality. The code was designed as a compatibility crutch for older Python versions or missing OS utilities. For this and other reasons, I decided to not this software as a starting point.

Introducing Smoke

As none of the existing multi-hash tools would work, I created a new tool that can generate or validate hashes for files and called it Smoke. In determining operational needs, I examined the existing single-hash command line tools (both OS-based and GNU-based). Running them, we can observe semi-standardized formats for reporting hashes on the cleartext “t123”:

05ec834345cbcf1b86f634f11fd79752bf3b01f3  t123      # OS X : shasum t123
05ec834345cbcf1b86f634f11fd79752bf3b01f3  t123      # GNU  : sha1sum t123
MD5 (t123) = cfd12d74bca9357022eb7d8367bcab26       # OS X : md5 t123
cfd12d74bca9357022eb7d8367bcab26 t123               # OS X : md5 -r t123
cfd12d74bca9357022eb7d8367bcab26  t123              # GNU  : md5sum t123

05ec834345cbcf1b86f634f11fd79752bf3b01f3  -         # OS X : shasum stdin
05ec834345cbcf1b86f634f11fd79752bf3b01f3  -         # GNU  : sha1sum stdin
cfd12d74bca9357022eb7d8367bcab26                    # OS X : md5 stdin
cfd12d74bca9357022eb7d8367bcab26  -                 # GNU  : md5sum stdin

05ec834345cbcf1b86f634f11fd79752bf3b01f3 t123 # OS X : shasum t123

05ec834345cbcf1b86f634f11fd79752bf3b01f3 t123 # GNU : sha1sum t123

MD5 (t123) = cfd12d74bca9357022eb7d8367bcab26 # OS X : md5 t123

cfd12d74bca9357022eb7d8367bcab26 t123 # OS X : md5 -r t123

cfd12d74bca9357022eb7d8367bcab26 t123 # GNU : md5sum t123

05ec834345cbcf1b86f634f11fd79752bf3b01f3 - # OS X : shasum stdin

05ec834345cbcf1b86f634f11fd79752bf3b01f3 - # GNU : sha1sum stdin

cfd12d74bca9357022eb7d8367bcab26 # OS X : md5 stdin

cfd12d74bca9357022eb7d8367bcab26 - # GNU : md5sum stdin

The BSD-based version of md5 used in OS X produces a verbosely formatted “tagged” output that is to be avoided. In fact, both BSD-native md5 and shasum programs produce this tagged format, yet OS X remixes things up by using the BSD md5 and a Perl script for shasum, thus the output formats differ. The GNU and Perl versions of md5 and sha1 are basically “hash whitespace filename”. Sometimes whitespace is spaces, sometimes it is a tab. There is also an asterisk (*) added in instances where hashes are computed on binary files versus treating files as text (not shown here). Some software used a dash (-) for standard in, others did not.

From these various existing output format ideas, I decided to use delimited formatting for Smoke, but with a single tab for the whitespace for easier downstream parsing. I also decided that binary-only assumptions is better and eschew the asterisk completely. Since Smoke utilizes multiple hashes, the output must contain the hash name. Thus, in keeping the same tab delimitation, each line might be:

sha1    05ec834345cbcf1b86f634f11fd79752bf3b01f3    t123
md5 cfd12d74bca9357022eb7d8367bcab26    t123

1 2	sha1 05ec834345cbcf1b86f634f11fd79752bf3b01f3 t123 md5 cfd12d74bca9357022eb7d8367bcab26 t123

An argument can be made that algorithms have known bit-lengths, thus you can completely eschew providing the algorithm name for each line. So, 40 hex digits is md5, 96 means sha384. Bad idea, because: 1) a future algorithm may have the same output bit length as an existing one and 2) there are commonly used truncated hashes, like sha-512/256 which happens to be the same bit-length as sha256. Thus, the hash algorithm must be specified.

After some testing and getting feed back from clients, we determined the duplication of filenames per line are also unneeded. The tab/newline format assists when a human reads the file, but this is not necessary for a machine. This format also makes stream processing harder if files become sorted; you’d have to consume the whole SUMS file to find all hash digest for a single filename. So, I chose a single-line implementation instead: hash1=val1;hash2=val2 (tab) filename. The previous example would then be like so:

sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;md5=cfd12d74bca9357022eb7d8367bcab26  t123

1	sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;md5=cfd12d74bca9357022eb7d8367bcab26 t123

This is slightly less readable for a human, but much nicer to process and conceptualize. One item per line; first part is the mutli-hash, second is filename. Stdin is specified with a dash (-) as the filename.

The Smoke file format will also ignore empty lines and lines starting with “#” as a convenience. I coded my Python parser to be more forgiving by striping whitespace from the hashes and to deal with spaces versus tabs – but the standard file should always use tabs to separate fields.

The hash names are normalized: lowercase name, no dashes. Thus, SHA-1 becomes sha1. I did not define a special case for “sha”, it must be written as “sha1”. “SHA” by itself is too ambiguous.

Additionally, the hash is always lowercase hex digits. By using hex, the hash’s length is doubled in size. Base-64 would only increase the size by 50%, but there are too many problems with Base64 and transmission. The special characters / and + get lost in a URL plus the whole = and == string endings that might get in the way of name=value pair divination. There is the alternate Base64URL encoding which changes those / and + characters and removes the = and == endings in certain conditions. Too many variables for too little gain – thus Smoke’s input & output format requires Hex-Digits.

Catching Collisions

Does Combining MD5 plus SHA1 Create A More Awesomely Secure Hash? Nope. There is a wonderful PhD dissertation by Anja Lehmann which goes into great depth as to why the combination of two different hashes is not significantly better. Some less technical explanations are offered here.

Smoke’s combining of the hashes, if anything, provides some bit of future-proofing for when (not if) a hash algorithm is deemed “broken”. Thus, if techniques to produce reasonable collisions in O(1) time for sha1 are created, the Smoked Hash will still contain the other “safe” hashes which are validated simultaneously when sha1 is computed.

Here is an example which shows how Smoke will catch a hash collision in one algorithm. Using the 128-byte md5 collision file created by Xiaoyun Wang and Hongbo Yu, here is a run of md5 and smoke in checksum mode:

$ cp collisions/md5-collide-? .
$ gmd5sum md5-collide-1 > md5-collide.md5sums  ;# compute hash on file-1 
$ cat md5-collide.md5sums
79054025255fb1a26e4bc422aef54eb4 md5-collide-1

$ mv md5-collide-2 md5-collide-1       ;# overwrite file-1 with falsified data
$ gmd5sum -c md5-collide.md5sums       ;# perform hash digest validation
md5-collide-1: OK

$ cp collisions/md5-collide-? .        ;# restore original files
$ diff -q md5-collide-1 md5-collide-2
Files md5-collide-1 and md5-collide-2 differ

$ ./smoke md5-collide-1 > md5-collide.smoke
$ cat md5-collide.smoke
sha1=a34473cf767c6108a5751a20971f1fdfba97690a;sha512=9272889ad0f7372047229d19ca58b93c539002f21f6e3da1697514406fe9b3cb20fcf546b9005ebadc16691a71658af848e35abad58422d5ae4650c21d7ad749;md5=79054025255fb1a26e4bc422aef54eb4  md5-collide-1

$ mv md5-collide-2 md5-collide-1       ;# overwrite file-1 with falsified data
$ ./smoke -c md5-collide.smoke         ;# perform hash digest validation
md5-collide-1: FAILED
#WARN: DIFFS {'sha1': ('428...' , a34...') , 'sha512': ('771...', '927...')}
#INFO: hashes matching:  {'md5': 'md5'}

$ cp collisions/md5-collide-? .

$ gmd5sum md5-collide-1 > md5-collide.md5sums ;# compute hash on file-1

$ cat md5-collide.md5sums

79054025255fb1a26e4bc422aef54eb4 md5-collide-1

$ mv md5-collide-2 md5-collide-1 ;# overwrite file-1 with falsified data

$ gmd5sum -c md5-collide.md5sums ;# perform hash digest validation

md5-collide-1: OK

$ cp collisions/md5-collide-? . ;# restore original files

$ diff -q md5-collide-1 md5-collide-2

Files md5-collide-1 and md5-collide-2 differ

$ ./smoke md5-collide-1 > md5-collide.smoke

$ cat md5-collide.smoke

sha1=a34473cf767c6108a5751a20971f1fdfba97690a;sha512=9272889ad0f7372047229d19ca58b93c539002f21f6e3da1697514406fe9b3cb20fcf546b9005ebadc16691a71658af848e35abad58422d5ae4650c21d7ad749;md5=79054025255fb1a26e4bc422aef54eb4 md5-collide-1

$ mv md5-collide-2 md5-collide-1 ;# overwrite file-1 with falsified data

$ ./smoke -c md5-collide.smoke ;# perform hash digest validation

md5-collide-1: FAILED

#WARN: DIFFS {'sha1': ('428...' , a34...') , 'sha512': ('771...', '927...')}

#INFO: hashes matching: {'md5': 'md5'}

As can be seen above, smoke operates just like existing hash software using -c as a checksum function, but smoke will perform the checksum operation using all hashes provided. There are two debugging lines, #INFO and #WARN, which show the validations used; these can be turned off. The software’s normal output is that the digest checksum has failed for the file, exactly as we want.

And even with all of these extra hashes, the software is nearly as fast as using a single hash generator.

Speed and More Speed

When generating hashes serially, hashing is very slow. Smoke tries to optimize the process by performing the slowest part of hashing only once: the disk read. Even with SSD drives, disk I/O is still slower than CPU computation. Smoke also does a second speed up: use multiple CPU cores, one per algorithm.

Here are some metrics for a 24GB file, a size large enough in order to remove the disk and memory caching factors. This test was performed using a five-year-old MacBook Pro with 8GB RAM and a 5400rpm drive. The average times for five runs of OS X and GNU software versus Smoke:

	real	user	sys
md5/osx	61.83	58.53	12.34
sha1/osx	76.03	60.78	10.12
sha512/osx	124.85	111.67	10.86
md5/gnu	66.78	53.16	8.27
sha1/gnu	66.09	51.46	8.45
sha512/gnu	106.69	94.36	8.52
smoke	74.58	133.07	14.25

If you’ve never seen the real/user/sys notation, Real time is time passed in the real world (e.g., look at a wall clock). User time is how long your program runs across all CPU cores. Sys time how much time the OS spends loading files, context switching your threads, etc.

As my implementation of smoke is multi-threaded, the user time is higher than with the other algorithms, but not by very much. Remember that Smoke has generated all three results. Thus, to get a real comparison with the other tools, you need to combine their the sha1/md5/sha512 times together. So:

	real	user	sys
sum of osx	262.70	230.98	33.31
sum of gnu	239.57	198.99	25.25
smoke speedup vs osx	3.52 ×	1.73 ×	2.33 ×
smoke speedup vs gnu	3.21 ×	1.49 ×	1.77 ×

From this standpoint, Smoke’s real world wall clock time is 3.2-3.5 times faster overall than the other programs run individually.

The test harness shown below was run via ./time-test.sh >> outname-num.txt 2>&1 five times and the real/user/sys lines extracted into a spreadsheet. The averages of each software+algorithm was published above.

test_file="$HOME/z-No-Backup/Big-File.7z"    ; # 24GB

# Time Command
tc="/usr/bin/time -l"

$tc md5 -r $test_file
echo
$tc shasum -b $test_file
echo
$tc shasum -b -a 512 $test_file
echo
$tc gmd5sum -b $test_file
echo
$tc gsha1sum -b $test_file
echo
$tc gsha512sum -b $test_file
echo
$tc ./smoke $test_file

test_file="$HOME/z-No-Backup/Big-File.7z" ; # 24GB

# Time Command

tc="/usr/bin/time -l"

$tc md5 -r $test_file

echo

$tc shasum -b $test_file

echo

$tc shasum -b -a 512 $test_file

echo

$tc gmd5sum -b $test_file

echo

$tc gsha1sum -b $test_file

echo

$tc gsha512sum -b $test_file

echo

$tc ./smoke $test_file

During testing, the verbose version of /usr/bin/time was used, producing extra information such as memory sizes, page faults, file blocking statistics, etc. There were differences in each piece of software, but the one that stands out was the “maximum resident set size” – the RAM used.

alg/impl	bytes	MB
md5/osx	2,406,634,291	2,295.1 MB
sha1/osx	4,265,574	4.1 MB
sha512/osx	4,294,246	4.1 MB
md5/gnu	861,798	0.8 MB
sha1/gnu	847,872	0.8 MB
sha512/gnu	864,256	0.8 MB
smoke	9,601,024	9.2 MB

The OS X / BSD version of md5 tried to map the whole 24-ish GB file into RAM, perhaps using C’s memmap(); the others did not. Smoke’s memory footprint was around 9 MB. Reducing the memory cache buffer by half to 0.5 MB reduced the memory footprint by 0.5 MB and increased the runtime by 5-6 seconds. There is probably a happy middle ground that could be determined for speed/size trade-off. Smoke does have a command-line option to reduce the memory buffer for low memory situations. However, the Python+OpenSSL overhead does present a limit to the memory savings for this Smoke implementation.

Hashing in Smoke is done using Python’s default hashlib implementation. Each hashing algorithm could be made a little faster if a different underlying crypto library were used. In Timo Bingmann’s report, certain libraries are definitely faster than others with the same algorithm. I couldn’t find more recent speed tests for new versions of OpenSSL (used by Python) versus, say, Apple’s CoreCrypto library. But, the library used really does not matter — especially compared to disk read time and threading. A hardware implementation of the hashing algorithms should best even the fastest of libraries.

Even with threading, the speed of Smoke is still limited by the slowest algorithm. There are diminishing returns on additional parallelization of Smoke.

Algorithms Supported

So far, this paper has mentioned md5, sha1, sha256, and sha512. But, the smoke file format is algorithm agnostic. Thus, any hashing digest is supported, including md4, ripemd160, whirlpool, etc. Since this project was coded in Python, Smoke automatically inherits all digest algorithms that Python, né OpenSSL, supports. On MacOS High Sierra, this is a large set including many algorithms that I’ve only heard of at Ballmer Peak infosec events.

In general, a smoked hash digest should include a minimal set of algorithm: sha1, md5, and sha512 for maximal support. Including additional algorithms does not pose negative impact on downstream systems. If a downstream system performing a hash verification does not support streebog512 as provided by the system creating the smoke digest, then the downstream system will simply consume other digests provided, ergo sha1 & sha512.

Output and Compatibility

Since one of my clients has downstream customers that want a single SUMS file and other customers want a different hash digest file per file on the disk, I added options to generate all of these scenarios at the same time. So, smoke can generate filename.md5 and filename.sha1 along with SUMS.smoke and SUMS.md5. I made output flexible to minimize the hash generation time.

So, for a download structure like Ubuntu uses, Smoke can generate all of the SUMS files in a single optimized run.

Smoke can only validate a checksum using a SUMS.smoke file. I did not implement logic for checking filename.smoke or for the other singular hashes. Perhaps a future endeavor for someone else.

I also did not create binding for other languages. The beauty of smoke is the generic file format, not the Python script which I coded that generates the hashes. Thus, it is relatively simple to create the smoke output format in which ever language desired. Just make the output format hash1=val1;hash2=val2 (tab) filename (newline) and add a tiny bit of optional threading.

Implementation

Smoke is a concept: many hashes combined into a single, simple transmission format. The implementation is a command line utility called smoke and it is a feature-complete hash digest generator and validator. Here is the help execution:

$ ./smoke --help
usage: smoke [-h] [--stdout] [--smoke-file]
                [--smoke-file-name SMOKE_FILE_NAME] [--digest-per-file]
                [--multiple-smokes] [--multiple-sums-digests]
                [--hash-hashed-files] [--ignore-unknown-algs] [-c CHECKFILE]
                [--debug] [--verbose] [--show-algs] [--show-defaults] [-O]
                [-a USE_ALGS] [--version] [--print-config]
                [files [files ...]]

Smoke - A unified means of generating and validating hash digests. Author: Jay
Ball @veggiespam. Command line arguments are beta and subject to change.

optional arguments:
  -h, --help            show this help message and exit

Hash generation destinations:
  Results are sent to stdout by default; can send to specified multiple
  destinations, both files and stdout

  --stdout              Output smoked hash for all files to stdout
  --smoke-file          Save smoked hash to single sums file, SMOKESUMS
  --smoke-file-name SMOKE_FILE_NAME
                        Name of smoked hash file, default SMOKESUMS
  --digest-per-file     Output digests per file, filename.md5, filename.sha1,
                        etc
  --multiple-smokes     Output a smoke for each file, filename1.smoke,
                        filename2.smoke
  --multiple-sums-digests
                        Output multiple digest summaries per algorithm,
                        SHA1SUMS, MD5SUMS, etc.
  --hash-hashed-files   Normally, SMOKESUMS, f.smoke, MD5SUMS, f.md5, f.sha1,
                        etc are ignored; this hashes them anyway

Hash validation:
  --ignore-unknown-algs
                        ignore unknown algs in SUMS file or command line
  -c CHECKFILE, --check CHECKFILE
                        check file to validate against or "-" for stdin
  --debug               debug info to stderr
  --verbose             verbose info to stderr

Options for both:
  --show-algs           Show all supported algorithms and exit
  --show-defaults       Show default used algorithms and exit
  -O, --use-only-algs   Only use algs specified with --use-algs, do not append
                        defaults
  -a USE_ALGS, --use-algs USE_ALGS
                        Algorithms to use, appends to defaults unless --use-
                        only-algs is present
  --version             show program's version number and exit
  --print-config        Print debug configuration and exit
  files                 Files to smoke

$ ./smoke --help

usage: smoke [-h] [--stdout] [--smoke-file]

[--smoke-file-name SMOKE_FILE_NAME] [--digest-per-file]

[--multiple-smokes] [--multiple-sums-digests]

[--hash-hashed-files] [--ignore-unknown-algs] [-c CHECKFILE]

[--debug] [--verbose] [--show-algs] [--show-defaults] [-O]

[-a USE_ALGS] [--version] [--print-config]

[files [files ...]]

Smoke - A unified means of generating and validating hash digests. Author: Jay

Ball @veggiespam. Command line arguments are beta and subject to change.

optional arguments:

-h, --help show this help message and exit

Hash generation destinations:

Results are sent to stdout by default; can send to specified multiple

destinations, both files and stdout

--stdout Output smoked hash for all files to stdout

--smoke-file Save smoked hash to single sums file, SMOKESUMS

--smoke-file-name SMOKE_FILE_NAME

Name of smoked hash file, default SMOKESUMS

--digest-per-file Output digests per file, filename.md5, filename.sha1,

etc

--multiple-smokes Output a smoke for each file, filename1.smoke,

filename2.smoke

--multiple-sums-digests

Output multiple digest summaries per algorithm,

SHA1SUMS, MD5SUMS, etc.

--hash-hashed-files Normally, SMOKESUMS, f.smoke, MD5SUMS, f.md5, f.sha1,

etc are ignored; this hashes them anyway

Hash validation:

--ignore-unknown-algs

ignore unknown algs in SUMS file or command line

-c CHECKFILE, --check CHECKFILE

check file to validate against or "-" for stdin

--debug debug info to stderr

--verbose verbose info to stderr

Options for both:

--show-algs Show all supported algorithms and exit

--show-defaults Show default used algorithms and exit

-O, --use-only-algs Only use algs specified with --use-algs, do not append

defaults

-a USE_ALGS, --use-algs USE_ALGS

Algorithms to use, appends to defaults unless --use-

only-algs is present

--version show program's version number and exit

--print-config Print debug configuration and exit

files Files to smoke

For our sample runs, we use two 4-byte files t123 and t456 that contain “t123” and “t456” respectively; there is no newline at the end of either data file.

First, let’s start with the operational basics, such as how to create a hash digest from stdin:

$ ./smoke  < t123
sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;sha512=5c7b8e44d46c535ee4c0caedde5cb4e4dc70826e274ab63f49c4f036e9e337a4b6e4de5a874fe5a2962dc7e603308edbcbd3494ac7ceabdecad057f6596aac4c;md5=cfd12d74bca9357022eb7d8367bcab26  -

1 2	$ ./smoke < t123 sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;sha512=5c7b8e44d46c535ee4c0caedde5cb4e4dc70826e274ab63f49c4f036e9e337a4b6e4de5a874fe5a2962dc7e603308edbcbd3494ac7ceabdecad057f6596aac4c;md5=cfd12d74bca9357022eb7d8367bcab26 -

That looks like any other digest software; the hashes followed by a tab, then by a dash (-). Next, get a digest from two input files:

$ ./smoke t123 t456
sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;sha512=5c7b8e44d46c535ee4c0caedde5cb4e4dc70826e274ab63f49c4f036e9e337a4b6e4de5a874fe5a2962dc7e603308edbcbd3494ac7ceabdecad057f6596aac4c;md5=cfd12d74bca9357022eb7d8367bcab26  t123
sha1=c632f2ea2a88f9778276bdc6830f04be67695464;sha512=d9e46b597862a7ecb9489304c0b5b27ef5ce38ca6c0c9193cbdd6cdb888b5fdde9395d54d746051f5010490910fceb0c1dc4e8e0ce2c5b2b9f0a32f9d589c923;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3  t456

$ ./smoke t123 t456

sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3;sha512=5c7b8e44d46c535ee4c0caedde5cb4e4dc70826e274ab63f49c4f036e9e337a4b6e4de5a874fe5a2962dc7e603308edbcbd3494ac7ceabdecad057f6596aac4c;md5=cfd12d74bca9357022eb7d8367bcab26 t123

sha1=c632f2ea2a88f9778276bdc6830f04be67695464;sha512=d9e46b597862a7ecb9489304c0b5b27ef5ce38ca6c0c9193cbdd6cdb888b5fdde9395d54d746051f5010490910fceb0c1dc4e8e0ce2c5b2b9f0a32f9d589c923;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3 t456

Same thing, one line per file, with the hashes first, the tab, and then the name of the file. Earlier, Ubuntu’s download folder was mentioned as being easily commutable with the digest files MD5SUMS, SHA1SUMS, and SHA256SUMS. How would that look on the command line?

./smoke --multiple-sums-digests --use-only-algs \
        --use-algs=md5,sha1 --use-algs=sha256 --smoke-file t123 t456

1 2	./smoke --multiple-sums-digests --use-only-algs \ --use-algs=md5,sha1 --use-algs=sha256 --smoke-file t123 t456

The option --multiple-sums-digests produces the “SUMS” collection of files. Since the default is to use sha512 and not sha256, the option --use-only-algs turns off the defaults and the --use-algs and -a flags starts building up the hashes you wish to use. As can be seen, you can just separate the algorithms with commas or using multiple command line entries. Finally, the --smoke-file produces the combined SMOKESUMS file, since we are going for the future of hashing here. After running this command line, here is what we get:

$ ls -lg
total 64
-rwxr-xr-x  1 staff  16399 Jan  1 00:01 smoke
-rw-r--r--  1 staff      4 Jan 10 08:23 t123
-rw-r--r--  1 staff      4 Jan 10 08:23 t456

# Note: we are mixing long and short command line options!
$ ./smoke --multiple-sums-digests --use-only-algs --use-algs=md5,sha1 -a sha256 --smoke-file t123 t456
$ ls -lg
total 128
-rw-r--r--  1 staff     76 Jan 15 09:25 MD5SUMS
-rw-r--r--  1 staff     92 Jan 15 09:25 SHA1SUMS
-rw-r--r--  1 staff    140 Jan 15 09:25 SHA256SUMS
-rw-r--r--  1 staff    320 Jan 15 09:25 SMOKESUMS
-rwxr-xr-x  1 staff  16399 Jan  1 00:01 smoke
-rw-r--r--  1 staff      4 Jan 10 08:23 t123
-rw-r--r--  1 staff      4 Jan 10 08:23 t456

$ head *SUMS
==> MD5SUMS <==
cfd12d74bca9357022eb7d8367bcab26    t123
1dbdd8f9093b0a0ea51f2a27a2b0b8b3    t456

==> SHA1SUMS <==
05ec834345cbcf1b86f634f11fd79752bf3b01f3    t123
c632f2ea2a88f9778276bdc6830f04be67695464    t456

==> SHA256SUMS <==
f6b6d0d62eb661c6d3fd7e35e972a8ed44b4aa2fd6c87b449b82b1b7b1a2319f    t123
7837f643a7b8f50f921383810e7971b4e6283d434b357a594f9358372a909bfd    t456

==> SMOKESUMS <==
sha256=f6b6d0d62eb661c6d3fd7e35e972a8ed44b4aa2fd6c87b449b82b1b7b1a2319f;md5=cfd12d74bca9357022eb7d8367bcab26;sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3  t123
sha256=7837f643a7b8f50f921383810e7971b4e6283d434b357a594f9358372a909bfd;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3;sha1=c632f2ea2a88f9778276bdc6830f04be67695464  t456

$ ls -lg

total 64

-rwxr-xr-x 1 staff 16399 Jan 1 00:01 smoke

-rw-r--r-- 1 staff 4 Jan 10 08:23 t123

-rw-r--r-- 1 staff 4 Jan 10 08:23 t456

# Note: we are mixing long and short command line options!

$ ./smoke --multiple-sums-digests --use-only-algs --use-algs=md5,sha1 -a sha256 --smoke-file t123 t456

$ ls -lg

total 128

-rw-r--r-- 1 staff 76 Jan 15 09:25 MD5SUMS

-rw-r--r-- 1 staff 92 Jan 15 09:25 SHA1SUMS

-rw-r--r-- 1 staff 140 Jan 15 09:25 SHA256SUMS

-rw-r--r-- 1 staff 320 Jan 15 09:25 SMOKESUMS

-rwxr-xr-x 1 staff 16399 Jan 1 00:01 smoke

-rw-r--r-- 1 staff 4 Jan 10 08:23 t123

-rw-r--r-- 1 staff 4 Jan 10 08:23 t456

$ head *SUMS

==> MD5SUMS <==

cfd12d74bca9357022eb7d8367bcab26 t123

1dbdd8f9093b0a0ea51f2a27a2b0b8b3 t456

==> SHA1SUMS <==

05ec834345cbcf1b86f634f11fd79752bf3b01f3 t123

c632f2ea2a88f9778276bdc6830f04be67695464 t456

==> SHA256SUMS <==

f6b6d0d62eb661c6d3fd7e35e972a8ed44b4aa2fd6c87b449b82b1b7b1a2319f t123

7837f643a7b8f50f921383810e7971b4e6283d434b357a594f9358372a909bfd t456

==> SMOKESUMS <==

sha256=f6b6d0d62eb661c6d3fd7e35e972a8ed44b4aa2fd6c87b449b82b1b7b1a2319f;md5=cfd12d74bca9357022eb7d8367bcab26;sha1=05ec834345cbcf1b86f634f11fd79752bf3b01f3 t123

sha256=7837f643a7b8f50f921383810e7971b4e6283d434b357a594f9358372a909bfd;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3;sha1=c632f2ea2a88f9778276bdc6830f04be67695464 t456

If the command line ./smoke -a sha256 --multiple-sums-digests --smoke-file t123 t456 was used instead, then the default algorithms (sha1 md5 sha512) would have been used plus the additionally specified sha256. As such, there would also be a SHA512SUMS file generated.

The other command usage pattern is a single digest per file, such as fileA.md5 / fileB.md5. If someone wants this single digest file for two specific algorithms, this command could be used:

$ ./smoke -O -a whirlpool -a md5 --multiple-smokes --digest-per-file   t123 t456

$ ls -lg
total 200
-rwxr-xr-x  1 staff  16399 Jan  1 00:01 smoke
-rw-r--r--  1 staff      4 Jan 10 08:23 t123
-rw-r--r--  1 staff     33 Jan 15 10:55 t123.md5
-rw-r--r--  1 staff    181 Jan 15 10:55 t123.smoke
-rw-r--r--  1 staff    129 Jan 15 10:55 t123.whirlpool
-rw-r--r--  1 staff      4 Jan 10 08:23 t456
-rw-r--r--  1 staff     33 Jan 15 10:55 t456.md5
-rw-r--r--  1 staff    181 Jan 15 10:55 t456.smoke
-rw-r--r--  1 staff    129 Jan 15 10:55 t456.whirlpool

$ head t???.*
==> t123.md5 <==
cfd12d74bca9357022eb7d8367bcab26

==> t123.smoke <==
whirlpool=e308efd94ab1810cfe44ea4b368f050b260ffc49c6f47a7ef8d58533a70e8e4bdfb0ff983f883ed2bc8dc08dad2e545e1cdf7da9ac4b400bd45bdf439a09fd0a;md5=cfd12d74bca9357022eb7d8367bcab26 t123

==> t123.whirlpool <==
e308efd94ab1810cfe44ea4b368f050b260ffc49c6f47a7ef8d58533a70e8e4bdfb0ff983f883ed2bc8dc08dad2e545e1cdf7da9ac4b400bd45bdf439a09fd0a

==> t456.md5 <==
1dbdd8f9093b0a0ea51f2a27a2b0b8b3

==> t456.smoke <==
whirlpool=5aa927dbaebfb0a1bdfe76eeee0863404647f2a491f45685111c21a4d83563bfb231befaf9f969f1ac175d4200baad20ee11e6ea5835b9b850a3859c19db0303;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3 t456

==> t456.whirlpool <==
5aa927dbaebfb0a1bdfe76eeee0863404647f2a491f45685111c21a4d83563bfb231befaf9f969f1ac175d4200baad20ee11e6ea5835b9b850a3859c19db0303

$ ./smoke -O -a whirlpool -a md5 --multiple-smokes --digest-per-file t123 t456

$ ls -lg

total 200

-rwxr-xr-x 1 staff 16399 Jan 1 00:01 smoke

-rw-r--r-- 1 staff 4 Jan 10 08:23 t123

-rw-r--r-- 1 staff 33 Jan 15 10:55 t123.md5

-rw-r--r-- 1 staff 181 Jan 15 10:55 t123.smoke

-rw-r--r-- 1 staff 129 Jan 15 10:55 t123.whirlpool

-rw-r--r-- 1 staff 4 Jan 10 08:23 t456

-rw-r--r-- 1 staff 33 Jan 15 10:55 t456.md5

-rw-r--r-- 1 staff 181 Jan 15 10:55 t456.smoke

-rw-r--r-- 1 staff 129 Jan 15 10:55 t456.whirlpool

$ head t???.*

==> t123.md5 <==

cfd12d74bca9357022eb7d8367bcab26

==> t123.smoke <==

whirlpool=e308efd94ab1810cfe44ea4b368f050b260ffc49c6f47a7ef8d58533a70e8e4bdfb0ff983f883ed2bc8dc08dad2e545e1cdf7da9ac4b400bd45bdf439a09fd0a;md5=cfd12d74bca9357022eb7d8367bcab26 t123

==> t123.whirlpool <==

e308efd94ab1810cfe44ea4b368f050b260ffc49c6f47a7ef8d58533a70e8e4bdfb0ff983f883ed2bc8dc08dad2e545e1cdf7da9ac4b400bd45bdf439a09fd0a

==> t456.md5 <==

1dbdd8f9093b0a0ea51f2a27a2b0b8b3

==> t456.smoke <==

whirlpool=5aa927dbaebfb0a1bdfe76eeee0863404647f2a491f45685111c21a4d83563bfb231befaf9f969f1ac175d4200baad20ee11e6ea5835b9b850a3859c19db0303;md5=1dbdd8f9093b0a0ea51f2a27a2b0b8b3 t456

==> t456.whirlpool <==

5aa927dbaebfb0a1bdfe76eeee0863404647f2a491f45685111c21a4d83563bfb231befaf9f969f1ac175d4200baad20ee11e6ea5835b9b850a3859c19db0303

To generate both the individual filename.hashtype and the HASHSUMS files (aka “Kitchen Sink”), you might do something like:

$ ./smoke -O -a streebog512 -a ripemd160 --multiple-sums-digests --multiple-smokes --digest-per-file --smoke-file  t123 t456

$ ls -lg
total 288
-rw-r--r--  1 staff     92 Jan 15 12:52 RIPEMD160SUMS
-rw-r--r--  1 staff    394 Jan 15 12:52 SMOKESUMS
-rw-r--r--  1 staff    268 Jan 15 12:52 STREEBOG512SUMS
-rwxr-xr-x  1 staff  16399 Jan  1 00:01 smoke
-rw-r--r--  1 staff      4 Jan 10 08:23 t123
-rw-r--r--  1 staff     41 Jan 15 12:52 t123.ripemd160
-rw-r--r--  1 staff    197 Jan 15 12:52 t123.smoke
-rw-r--r--  1 staff    129 Jan 15 12:52 t123.streebog512
-rw-r--r--  1 staff      4 Jan 10 08:23 t456
-rw-r--r--  1 staff     41 Jan 15 12:52 t456.ripemd160
-rw-r--r--  1 staff    197 Jan 15 12:52 t456.smoke
-rw-r--r--  1 staff    129 Jan 15 12:52 t456.streebog512

$ head *SUMS t???.*
==> RIPEMD160SUMS <==
22150c08e4d0431bed36e60b0436c6078235c669    t123
ed80f0f02c441c6c408066885e1f114eaada6b9e    t456

==> SMOKESUMS <==
streebog512=f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f;ripemd160=22150c08e4d0431bed36e60b0436c6078235c669 t123
streebog512=d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939;ripemd160=ed80f0f02c441c6c408066885e1f114eaada6b9e t456

==> STREEBOG512SUMS <==
f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f    t123
d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939    t456

==> t123.ripemd160 <==
22150c08e4d0431bed36e60b0436c6078235c669

==> t123.smoke <==
streebog512=f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f;ripemd160=22150c08e4d0431bed36e60b0436c6078235c669 t123

==> t123.streebog512 <==
f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f

==> t456.ripemd160 <==
ed80f0f02c441c6c408066885e1f114eaada6b9e

==> t456.smoke <==
streebog512=d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939;ripemd160=ed80f0f02c441c6c408066885e1f114eaada6b9e t456

==> t456.streebog512 <==
d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939

$ ./smoke -O -a streebog512 -a ripemd160 --multiple-sums-digests --multiple-smokes --digest-per-file --smoke-file t123 t456

$ ls -lg

total 288

-rw-r--r-- 1 staff 92 Jan 15 12:52 RIPEMD160SUMS

-rw-r--r-- 1 staff 394 Jan 15 12:52 SMOKESUMS

-rw-r--r-- 1 staff 268 Jan 15 12:52 STREEBOG512SUMS

-rwxr-xr-x 1 staff 16399 Jan 1 00:01 smoke

-rw-r--r-- 1 staff 4 Jan 10 08:23 t123

-rw-r--r-- 1 staff 41 Jan 15 12:52 t123.ripemd160

-rw-r--r-- 1 staff 197 Jan 15 12:52 t123.smoke

-rw-r--r-- 1 staff 129 Jan 15 12:52 t123.streebog512

-rw-r--r-- 1 staff 4 Jan 10 08:23 t456

-rw-r--r-- 1 staff 41 Jan 15 12:52 t456.ripemd160

-rw-r--r-- 1 staff 197 Jan 15 12:52 t456.smoke

-rw-r--r-- 1 staff 129 Jan 15 12:52 t456.streebog512

$ head *SUMS t???.*

==> RIPEMD160SUMS <==

22150c08e4d0431bed36e60b0436c6078235c669 t123

ed80f0f02c441c6c408066885e1f114eaada6b9e t456

==> SMOKESUMS <==

streebog512=f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f;ripemd160=22150c08e4d0431bed36e60b0436c6078235c669 t123

streebog512=d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939;ripemd160=ed80f0f02c441c6c408066885e1f114eaada6b9e t456

==> STREEBOG512SUMS <==

f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f t123

d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939 t456

==> t123.ripemd160 <==

22150c08e4d0431bed36e60b0436c6078235c669

==> t123.smoke <==

streebog512=f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f;ripemd160=22150c08e4d0431bed36e60b0436c6078235c669 t123

==> t123.streebog512 <==

f42f9820d136832079514096a7a538b037829308daa638a527c7d477bd67a07bc850fbafe47cd3ec2135b211691ba79bef442d1d41cb0f9fdee5ca69f482cc9f

==> t456.ripemd160 <==

ed80f0f02c441c6c408066885e1f114eaada6b9e

==> t456.smoke <==

streebog512=d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939;ripemd160=ed80f0f02c441c6c408066885e1f114eaada6b9e t456

==> t456.streebog512 <==

d480fad9f4d36ec9102428d2183ad93d42b92c2db6be9f616d98ba3f175eb96d30bb7ec7abf19b2cbc40b69afcafc80f819cd80f7b2a8ba9f3900f8587023939

You can mix and match the options to generate the output you require.

Code

All code has been published under the Apache License on GitHub. The code’s quality is, frankly, bad – I’m not a Python expert. The code runs, does what it needs to do, but not in the “Python way”. Someone more talented than I may need to perform a bit of assistance on the code to make it both pretty, readable, and expandable. Feel free to submit bug fixes, pulls, etc. Some future ideas:

Add generation of CRC32, et al as those are extremely useful. It would simply be another algorithm.
Get ideas for the best “short flags” on the command line. While --multiple-sums-digests is needed, maybe -M is better. Think this out before creating them and getting these command line flags set in stone.
The threading in Python is lazy – it could be made faster if threads were reused between files.
Standardized API or bindings for languages. There is a Python class, but it is probably very non-Python-like.
Add support to validate non-smoke checksum files, like MD5SUMS or filename.sha1.

Conclusions

Smoke aims to make hash generation quicker (lower disk I/O & parallelization), easier (one command line call to rule them all), more flexible (file format with multiple algorithms), and expandable (legacy support & future algorithms). I designed Smoke to be flexible to the needs of my clients and my Python implementation does all that they require and more. Eventually, I’d like to see smoke become part of the standard Unixy tool-set, along side the other existing hashing algorithms in /usr/bin or even /sbin.

Entomology

Why name this Smoke? The hash command on Unix was already taken. Simply put, hashes are smoked. A list of hashes is smoke stack. Transmission of them is done via smoke signal. I’m sure there are more puns to be had.

Author

Jay is a NYC #infosec professional who goes by @veggiespam on Twitter, GitHub, LinkedIn, and other networks while occasionally writing articles on Personal Site.

Feedback on this article is appreciated, contact via Direct Message on social media.

Impetus

Existing Software

Introducing Smoke

Catching Collisions

Speed and More Speed

Algorithms Supported

Output and Compatibility

Implementation

Code

Conclusions

Entomology

Author

Share this: