9

I understand that a rainbow table solves the storage problem when one attacks a password using precomputed hashes. However, since rainbow tables are essentially a compressed version of the hashes--dont they have to be decompressed before comparing with the target hashes? Isnt this decompression process expensive? How does the time used decompressing a rainbow table compare to that used in computing hashes from scratch?

Minaj
  • 1,556
  • 2
  • 15
  • 23
  • What really confuses me about hash functions is that, why we can't use big data techniques to predict or even figure out the logic being used to calculate hash at first place – Mathematics Aug 18 '16 at 10:22
  • 2
    @Mathematics You don't need to. Hashes are perfectly predictable and we know exactly how they work and given a string you will know exactly what it's hash will be with any given algorithm. The point is that the reverse way is impossible. – F.P Aug 18 '16 at 12:06

2 Answers2

27

Your basic premise is wrong: a rainbow table is not just a compressed list of every possible hash lookup, and you do still need to do some hashing on the fly. Instead, it's a way of exploiting the nature of hashes to avoid storing the lookups in the first place, and minimise the amount you need to re-compute.

Wikipedia has quite a detailed explanation and there is an existing question here with good answers, but the basic idea is you create a table like this:

  1. Start with a particular password guess
  2. Hash it
  3. Take the hashed value as the next password guess (after applying some reproducible transform)
  4. Hash that
  5. Repeat steps 3 and 4 a set number of times
  6. Store only the original guess, and the final hash. It is this that makes the rainbow table smaller than a full lookup table.

From the values you've stored, you can get back all the guesses generated at step 3. In a sense, each pair of { first guess, last hash } "compresses" the full chain of generated guesses.

But the trick is you don't need to try all the chains to reverse a hash, because if you take the hash you're attacking and start at step 2, you will eventually end up at one of the final hashes you stored (at step 6). Once you've found that, you can recreate ("decompress", if you like) just that one chain, by going from step 1 (the stored password guess) and generating all the intermediate guesses.

An important difference between this and compression is that you can make the stored table as small as you want, by making the chains longer - you'll just have to spend longer re-generating hashes to choose a chain, and to re-create the chosen chain. You could have a million chains of length ten, or ten of length one million, trading storage against CPU time.

It is of course possible to compress the resulting data using any algorithm you like. Some of these will require decompressing the entire table before searching it, others might arrange the data so that it is still searchable but takes up less space. But it would also be possible to store the entire rainbow table as, say, a sorted list on a fast SSD, and you would still have saved space over a full hash table because you are only storing the start and end of each chain, not every possible hash.

IMSoP
  • 3,790
  • 1
  • 15
  • 19
  • 1
    "It is of course possible to compress the resulting data using any algorithm you like." This seems unlikely. The entries in a rainbow table are usually random, and if the hash is cryptographically secure then the result after hash chaining should definitely appear random to a compression algorithm. – Steve Cox Aug 17 '16 at 19:41
  • "But it would also be possible to store the entire rainbow table as, say, a sorted list on a fast SSD ... you are only storing the start and end of each chain, not every possible hash." You actually should only be storing the end of each chain on your fastest media (ssd or ram). The beginning of each chain is comparatively accessed exceedingly rarely (~once per complete chain traversal) and can sit on cheaper bulk media. – Steve Cox Aug 17 '16 at 20:17
  • The biggest challenge in making a Rainbow table with 1 million length would be that a Rainbow table should be using a unique transform to turn the Hash into a possible password at each step in the chain. This is required to prevent collisions so you don't end up with two 1 million length chains that are covering almost all the same passwords. Making 1 million of these transform functions would seem like a pretty big challenge. – Evan Steinbrenner Aug 17 '16 at 21:24
  • @SteveCox Sure, all of those things would be important when choosing how to compress. But you *could* compress it any way you like - it just might not work very well. As for the random nature of hashes, a sorted list of sufficient size will have redundancy, as exploited by the compression scheme suggested in polynomial's answer. – IMSoP Aug 18 '16 at 08:43
  • @Evan, I suppose one would do that by creating one transform function that can be modified with a parameter, and then using something like a simple counter for the parameter to make the required difference. – ilkkachu Aug 18 '16 at 09:26
17

No. The compression doesn't work like traditional RLE or LZMA style compression.

Rainbow tables are, essentially, a lookup table which allows you to find a string given its hash. They're designed to be incredibly efficient at finding a hash in the index across billions of entries, while minimising disk space.

Now, imagine you're building a table for lots and lots of strings. The hashes of some of these strings start with the same bytes - for example, "StackExchange", "ILikeWaffles9", "ILikeWaffles13507", and "Decompression242" when hashed with MD5 all start with 0xF2. Instead of storing all three hashes fully, you can construct a tree-like structure such that data looks like this:

  • f2
    • 173dcd3c1a83febadc8ed1759c3ffc = "ILikeWaffles13507"
    • 17f4a64e4036025c07b24a96ec787a = "Decompression242"
    • 50514201b94be52c1ea16cd688384e = "ILikeWaffles9"
    • 5cb1c6953bb0c62c639f3d7a242ec4 = "StackExchange"

Note that the hashes are sorted by numeric order.

In fact, since the first two strings also share the same second byte (0x17) these can also be chained:

  • f2
    • 17
      • 3dcd3c1a83febadc8ed1759c3ffc = "ILikeWaffles13507"
      • f4a64e4036025c07b24a96ec787a = "Decompression242"
    • 50514201b94be52c1ea16cd688384e = "ILikeWaffles9"
    • 5cb1c6953bb0c62c639f3d7a242ec4 = "StackExchange"

This also allows you to perform a lookup incredibly quickly - instead of having to search the full table, you only have to traverse the tree, and then search through a smaller list of hashes. Since the hashes are sorted, you can perform a binary search, which also has very good performance.

As an example, if I have the hash f217f4a64e4036025c07b24a96ec787a, I look for the first tree node f2, then look to see if there's a sub-node for the second byte, 17. There is, so I continue down. I check to see if there's a sub-node for f4. There isn't, so I now search through the list within the f2 -> 17 node. I know that f4 is likely to be near the end of that list, so I start there. I find that the hash matches the one I'm searching for, so I now know that the plaintext is "Decompression242".

It is also incredibly space-efficient when you've got millions or billions of hashes, because you don't duplicate parts of the hash that are shared with other plaintexts.


EDIT: Sorry, I should have pointed out that this is not literally how rainbow tables work. This is just an example of how compression can work in this regard, without needing to actually save a full hash for each plaintext. I didn't mean to imply otherwise. IMSoP's answer better describes the actual workings.


The key thing to remember is that rainbow tables are only useful when you want to do multiple hash searches for that hash type. You generate a rainbow table for a particular string list or character list ahead of time, only once, and then you can use that generated data set as many times as you like. It's a trade-off of doing a lot of work ahead of time, so that your later searches are very fast.

Another key thing is that any hash system which includes a salt automatically renders rainbow tables useless, since each password and salt combination should (ideally) be unique and long enough to make it impractical to build a rainbow table for every possible password and hash combination.

Polynomial
  • 133,763
  • 43
  • 302
  • 380
  • 7
    [Rainbow tables](https://en.wikipedia.org/wiki/Rainbow_table) are about a lot more than just compressing the hashes using a tree structure! – IMSoP Aug 17 '16 at 15:17
  • @IMSoP My example is a simplification of the actual structure. There are additional optimisations in Rainbow Table, but the they aren't particularly relevant to the question of whether they need to be "decompressed". – Polynomial Aug 17 '16 at 15:29
  • 10
    You are missing the most fundamental aspect of what a rainbow table is, which is a way of exploiting the nature of hashes to store partial data and recompute part of it on demand. The tree-based storage you've described here could be done for any list of hash lookups, but that wouldn't make it a rainbow table. – IMSoP Aug 17 '16 at 15:43
  • 9
    I'm not sure why this is getting up votes. **The described compression scheme has nothing to do with rainbow tables**. It's certainly possible that the result of the rainbow table is stored this way, but it's also possible that it's LZMA-compressed and loaded into memory as a single sorted list, or any number of other compression and searching algorithms. – IMSoP Aug 17 '16 at 16:16
  • The space savings of this scheme are not big because the number of hashes required to save 1 bit more bit rises exponentially. – usr Aug 17 '16 at 17:34
  • 1
    @IMSoP The question isn't about what a rainbow table is - it's about the compression of a rainbow table. Dumping key value pairs into a sorted list and using a search algorithm has nothing to do with compression. Polynomial explains how you can implicitly compress the information such that your structure is almost all passwords and very little hash. – corsiKa Aug 17 '16 at 17:45
  • 5
    @corsiKa The premise of the question is wrong - rainbow tables are not "essentially a compressed version of the hashes" - so it cannot be meaningfully answered without correcting that misunderstanding. This answer explains one possible way that you *could* store a list of strings, but the statement "rainbow tables construct a tree-like structure" is simply false. I absolutely agree that sorting a list and searching it has nothing to do with compression; just as compression has nothing to do with rainbow tables. The algorithm presented here is interesting, but irrelevant. – IMSoP Aug 17 '16 at 17:57
  • 1
    @IMSoP You are correct, and I've added an edit. My intent wasn't to mislead, but rather to offer a simplistic example of how compression can be used without increasing computational cost, and in the process produce an efficient index. It seems in doing so I oversimplified and did more harm than good. – Polynomial Aug 17 '16 at 19:43
  • Thanks, your chosen approach to explain it in simplistic terms -- I get the picture. – Minaj Aug 18 '16 at 00:23
  • @Polynomial I really liked the way you explained storage but it is not clearing my doubts on storing hashes though, if we can have a tree structure of hashes why is it hard to predict them ? – Mathematics Aug 18 '16 at 10:27
  • @Mathematics The two things are separate. The tree is produced based on the binary representation of the hash. The hash itself could be anything; it doesn't even need to be a hash for that matter. The point is that it's an efficient storage and search method when you've got a large enough dataset for the first few (could be up to 5 or so) bytes to be guaranteed to repeat. Hashes are unpredictable without calculating them, but with enough the first bytes will always repeat. The same idea would work for searching any kind of string which does the same. – Polynomial Aug 18 '16 at 11:23
  • @Polynomial if hashes are so random then where is the guarantee coming that they will may have first bytes similar ? :S even after calculations they become predictable ("ignoring the salts for now") don't they ? I am asking question then challenging – Mathematics Aug 18 '16 at 11:51
  • @Mathematics Statistics, particularly the pigeonhole principle and birthday paradox. If every possible value for each byte of the hash is equally probably, then if you generate just 512 hashes then the chances are extremely high that some of them will share equal bytes. You can't predict what the hash will output for a specific string, but you can model the probability of colliding bytes based on a random oracle model. – Polynomial Aug 18 '16 at 13:35
  • @Polynomial "The two things are separate. ... The hash itself could be anything; it doesn't even need to be a hash for that matter." This is exactly why I don't think this is a good way of answering this question. The key insight is not that there are clever types of compression that would work well for rainbow tables, but that rainbow tables are small even if you don't compress them. – IMSoP Aug 18 '16 at 15:28
  • @IMSoP Possibly, as I said my intent was just to explain *how* that could be done in a simple way, which seemed to be what the person asking the question was most interested in. Rainbow Tables aren't really that simple so I chose an alternative way to explain it. – Polynomial Aug 18 '16 at 15:40
  • @Polynomial I guess we've interpreted the question differently. I've answered "why don't rainbow table-based crackers spend all their time decompressing data?" (answer: "it doesn't need to be compressed in the first place, it saves space in other ways"); you've answered "how would you compress a lookup table so that it wouldn't need decompressing before use?" (answer: "you could use a tree-based index"). I've taken "rainbow tables" to be the important aspect; you've taken "compression" to be the important aspect. – IMSoP Aug 18 '16 at 15:45
  • Even if this were how Rainbow tables worked, this may not be the most efficient way to search them. The pseudo-random nature of hashes makes them amenable to an interpolation search, which is asymptotically faster than a binary search (`log log` n vs `log n`). – James_pic Aug 22 '16 at 09:48