I'm building a URL-shortening tool. For an arbitrary link, I need to produce a fixed-length slug which will index the full URL in a database. For prettiness reasons, I'd like to keep the slug reasonably short (8 alphanumerical characters seems reasonable).
It seems obvious to me that using a hash function with output to hex is an easy way of generating such a slug without having to worry about collisions. I am not expecting more than a million links to ever be posted (it's not a public tool), so I don't need anything like the kind of collision resistance a normal hashing algorithm would provide. Unfortunately, the hash values tend to be rather long - even MD5 uses 32 hex characters (I also don't know how to square this with the fact that it produces a 128 bit value and 16^32 is much bigger than that).
Suppose I took some collision-resistant hash function like SHA-512, and I take the 128 character output:
ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f
and truncate it to just eight:
ddaf35a
I have two (probably naive) questions:
- Why does the 512 bit digest take 128 hex characters and not log_{16}(512)? Or, the other way round, can I cram 16^8 bits of entropy into eight hex characters (or any short alphanumeric string)?
- Assuming 1. is something obvious I don't see, does truncating a 128 character hash to 8 characters behave "like an 8-character hash function"? In other words, other than accounting for the reduced size of the hash space, are collisions more likely than you would expect from a hash function with a digest of that length?