34

What is it about GPUs that lets them crack passwords so quickly?

It seems like the driving force behind adopting good key-derivation functions for passwords (bcrpyt, PBKDF2, scrypt) instead of yesterday's cryptographic hash (MD*, SHA*) is that the later are vulnerable to programs that run on GPUs and guess huge numbers of passwords extremely quickly. Why should GPUs happen to be so much better at evaluating those hash functions than CPUs?

Nick
  • 445
  • 1
  • 4
  • 7

2 Answers2

43

To complete @Terry's answer: a GPU has a lot of cores (hundreds). Each core is basically able to compute one 32-bit arithmetic operation per clock cycle -- as a pipeline. Indeed, GPU work well with extreme parallelism: when there are many identical work units to perform, actually many more than actual cores ("identical" meaning "same instructions", but not "same data").

Some details, for a somewhat old NVidia card (a GTX 9800+, from early 2009): there are 128 cores, split into 16 "multicore units". Each multicore can initiate 8 operations per cycle (hence the idea of 128 cores: that's 16 times 8). The multicore handles work units ("threads") by groups of 32, so that when a multicore has an instruction to run, it actually issues that instruction to its 8 cores over 4 clock cycles. This is operation initiation: each individual operation takes up to 22 clock cycles to run. You can imagine the instruction and its operands walking into the circuit as an advancing front line, like a wave in a pool: a given wave will take some time to reach the other end of the pool, but you can send several waves sequentially.

So you can maintain the rhythm of "128 32-bit operations per cycle" only as long as you have at least 22 times as many "threads" to run (i.e. a minimum of 22·128 = 2816), such that threads can be grouped by packs of 32 "identical" threads which execute the same instructions at the same time, like hip-hop dancers. In practice, there are some internal thresholds and constraints which require more threads to achieve the optimal bandwidth, up to about 4096.

I could achieve close to 99% of the optimal bandwidth with a SHA-1 implementation. SHA-1 uses a bit more than 1100 32-bit operations (that would be around 900 on a CPU, but a GTX 9800+ has no rotation opcode, so rotations must be split into two shifts and a logical or), and the GPU ran at 1450 MHz, for a grand total of about 160 million SHA-1 computations per second. This can be achieved only as long as you have millions of SHA-1 instances to compute in parallel, as is the case for password cracking (at any time, you need 4096 parallel SHA-1 to feed the GPU cores, but you also have to deal with I/O costs for input of potential passwords, and these costs will dominate if you do not have a lot of SHA-1 instances to process).

The host PC, on its CPU (a quad-core 2.4 GHz Intel Core2), could achieve about 48 million SHA-1 per second, and that was with thoroughly optimized SSE2 code. A single SHA-1 will use about 500 clock cycles on such a CPU (the CPU can compute several instructions in a single cycle, provided they don't compete for resources and don't depend on each other), but, for password cracking, it is worthwhile to use SSE2 with its 128-bit registers, and able to compute 4 instructions in parallel. With SSE2 constraints, it takes about 800 clock cycles to run four parallel SHA-1, so that's 200 clock cycles per SHA-1 instance. There are four cores in that CPU and the whole thing runs at 2400 MHz, hence 48 million per second.

More recent hardware will be faster, but GPU more so. A GTX 680 sports a whooping 1536 cores, and there are two such GPU in a GTX 690. We are talking billions of SHA-1 instances per second here.

(For comparison, I also did an implementation of SHA-1 on the Cell processor, i.e. the CPU in a PS3 console, with its 8 "SPU" coprocessors. One SPU was not available. With the 7 others, I reached about 100 million SHA-1 per second, i.e. better than a contemporary big PC CPU, but not as good as a good GPU of the same era.)


Summary: GPU achieve great performance by using heavy parallelism, with hundreds (if not thousands) of cores. This is made possible by pipelining (each individual operation takes many cycles to run, but successive operations can be launched like trucks on a highway) and sharing instruction decoding (since many cores will run the same instructions at the same time).

Nick
  • 445
  • 1
  • 4
  • 7
Thomas Pornin
  • 322,884
  • 58
  • 787
  • 955
  • 5
    "Like hip-hop dancers" :) – Nick Mar 19 '13 at 15:32
  • Some context on the numbers: assuming a set of about 100 possible characters, there are 1 trillion possible 6-digit passwords. At a billion passwords per second, this is a bit under 17 minutes to try every possible 6-digit password. Less if we assume the character set is smaller (most people will never use a good chunk of the printable ASCII character set in passwords). Another reminder to salt your hashes. – Tom Marthenal Mar 23 '13 at 08:03
  • Salts do not protect against brute force attempts. *Slow hashes* do. – Stephen Touset May 24 '14 at 03:10
  • If a password algorithm were designed to maximize the usefulness of a GPU for validating a *single* password (e.g. given the password FNORD, compute iterated SHA-1 hashes for FNORD0000, FNORD0001, FNORD0002, etc. through FNORD9999 and then combine them via some method), would any kind of cracking-hardware be able provide more bang-for-the buck than would be available for a "good guy" system that needed to validate passwords one at a time? – supercat Aug 19 '15 at 23:14
  • Do GPUs actually perform a single 32-bit arithmetic operation in a single clock cycle (taking into account latency so e.g. a throughput of 4 and latency of 4 would still count as 1 retired instruction per cycle)? I know that for CPUs at least, different kinds of arithmetic might take several cycles, even dozens (in the case of division, for example). Are GPUs not like that? – forest Jun 30 '18 at 05:36
19

A GPU is excellent at processing mathematical calculations. Graphics rendering is simply a series of complex mathematical calculations. So are hashing algorithms.

A GPU has hundres of cores that can be used to compute mathematical functions in parallel. A CPU usually has 4-8 cores. Although a CPU core is much faster than a GPU core, password hashing is one of the functions that can be done in parallel very easily. This is what gives GPUs a massive edge in cracking passwords.

You should note that of the three algorithms you mentioned, PBKDF2 can still be cracked relatively easily on a GPU. The PBKDF2 algorithm in very basic terms hashes a password with a hash function like MD5 or SHA1 thousands of times. While much stronger than a simple MD5 or SHA1 hash, it can still be cracked relatively fast with a GPU.

bcrypt and scrypt are designed to avoid the massive speedup in cracking time a GPU affords an attacker. See this incredibly answer by Thomas Pornin for more information: https://security.stackexchange.com/a/31846/10211

  • Cool; I had no idea GPUs had so many cores! – Nick Mar 19 '13 at 07:55
  • 1
    A GPU tends not to have *thousands* of cores, but hundreds is accurate. – Polynomial Mar 19 '13 at 08:29
  • 2
    @Polynomial so 'stream processors' != 'GPU cores'? Because my card has almost 3,000 of those. – lynks Mar 19 '13 at 12:58
  • @lynks It's difficult to directly correlate between the idea of GPU internals and traditional cores, but you could describe each stream processor as a core. I'll admit I didn't realise modern GPUs had so many stream processors though. The main difference is that GPU cores tend to operate in blocks in order to perform a particular task, with certain special bus types joining them together, whereas CPU cores are practically independent. – Polynomial Mar 19 '13 at 20:26
  • They are defined as cores: http://cdn3.wccftech.com/wp-content/uploads/2013/02/er_photo_184846_52.png From a pure processing power perspective, the Titan is 14.2 times faster than the current fastest equivalent costing CPU. – k1DBLITZ Mar 26 '13 at 20:11