Most cryptographic hash functions operate on a fixed-size internal state and process the message to be hashed block by block, where a block is a number of bits. MD5, SHA-1 and SHA-2 follow the Merkle-Damgård construction which operates as follows:
- Let S be an array of N bits, initialized to a predefined value. N is the size of the internal state and may be different from the final size of the hash.
- Update the state based on each message block successively. More precisely, for each message block Mi, set S := C(Mi, S). The function C is known as the compression function; it takes a K-bit block and an N-bit state value and produces a new N-bit state value.
- For the last few bits of input, which may not form a complete block (if the length of the message is not a multiple of the block length), apply a padding function to S that takes the length of the message into account (the Merkle-Damgård construction requires a specific padding method which I am not describing here).
- Return F(S) where F is called the finalization function. F takes an N-bit state as input and returns an n-bit hash.
The compression function is so-called because it takes K + N bits of input and returns only N bits of output. This of course means that there are a lot of collisions: distinct (Mi,S) pairs producing the same resulting state. However, if done right, i.e. if it is computationally infeasable to find collisions for the compression function, then the hash itself is also collision-resistant.
There are non-Merkle-Damgård cryptographic hash functions, for example SHA-3, which uses a sponge construction. The operating method is a little different, but the core principle is the same:
- Work with a fixed-size state.
- Transform the state based on one message block at a time.
- Apply padding to the last block.
- Apply a finalization function to the internal state to return the hash value.
Non-cryptographic hashes often rely on arithmetic (but this approach doesn't tend to produce fast cryptographic hash functions). They use modular arithmetic as the compression function to keep the current state to a fixed size.