An RC5-64 Crack Engine

This document is a first effort - I'm putting it out for comment - it's all pretty tounge-in-cheek, but the design and numbers resulting from it are hopefully real - if you find any mistakes please drop me a note at paul@taniwha.com.

Like many people I've joined the distributed.net RC5-64 crack effort - a wonderfully (relatively) pointless effort - at the current crack rate my lowly Pentium is clocking along at 150,000 keys/sec - at this rate by myself I'll cover the entire 2**64 key key space in about 3 million years and with a million similar systems joining in we could do it in 3 years ..... barring blind-luck we're in it for a long haul ....

But I'm a chip designer ..... building FAST stuff is what I do for a living .... how could we speed up this process? The following is an evening's verilog hacking and some quick back of the envelope calculations - it's all doable (except perhaps the last section's musings on integration).

A paper design

There has already been some work done in looking at purpose build hardware solutions, including an FPGA design. I've stolen the algorithm from this design and hacked up a quick verilog implementation (to be put here later - it's very simple but I want to tidy it up a bit prior to releasing it). The basic key engine takes ~80 clocks to check a single key, and processes 2**32 keys before needing attention.

A quick run through synopsys with a 0.25 micron library gives a design that runs at 100MHz and takes ~17K gates - all this with out even trying hard - I'm sure that spending some time on it could bring the gate count down or the clock speed up (there's an interesting design space to work in here - more engines running slower vs. fewer running faster) - the design is pretty combinatorial all those adders and barrell shifters - and it's hard to pipe well - it looks like a tough nut to crack in hardware (I'm sure it's designed to be that way). Anyway all up, including scan it comes to ~17k gates - which at 0.25u is (very roughly) 0.7 sq mm.

In this design chips are build that contain clusters of key engines (basicly as many as you can economically fit on a die - I've been assuming 8 for much of this, although you might get 1 or 2 in an FPGA) - dies are connected into a tree structure with 13-bit buses connecting them together - 1 13-bit bus come down to each die from above and 2 go out to other dies below this one.

At the top level the first die is connected to a PC's parallel port - from there a controll program in the PC can distribute keys to engines and collect up the results. Each engine runs at 100MHz - and takes ~80 clocks to check a key - this gives a key rate of 1.25M keys/sec - since an individual engine processes 2**32 keys before it needs a new one from the host it runs by itself for about an hour - checking to see if it succeeded and loading a new key probably would take a millisecond or so (probably a lot less - depending on the depth of the tree of dies) - this means that a single PC could support roughly 3 million engines. Clustering the engines into groups of 8 per die we can run them together on larger ranges of keys (more than 2**32) and set them up together - meaning that a PC could support 8x the number of engines - ~24million (this limit is pretty arbitrary it's pretty easy to see how to raise this number).

If we have 8 engines/die this corresponds to 1M dies and a tree depth of ~20.

With 24 million crack engines each doing 1.25 million keys/sec we can walk the entire key space in 7 days - woo hoo! the prize is ours! .... well almost there are just a few details ....

I've put a summary and Verilog source for my design here

Why this doesn't work quite so well

So we want to build this thing that nabs RC5-64 in 7 days ... and we rush off, tape out our chip and place an order for 1 million dies - first of all we're going to get a volume discount .... suppose (I'm completely guessing here) we pay $1 each .... that's still $1M .... but we know we have a great chance of winning the $10k prize .....

Secondly how are we going to mount all this stuff - really big PC boards come to mind, suppose we put 100 chips on each board .... we're still going to need 10,000 boards it's a lot of space - and a wiring nightmare .... it's going in someone else's garage ....

At this point you realize that an 8-bit inter die bus is a problem because of all the copper outside the die - and drop it down to a serial bus - and waste some silicon to manage this instead.

How you might pull it off

One of the nice thing about this sort of problem is that not only is it highly parallel but the IPC required is pretty minimal - each engine needs to be talked to for 1mS/hour and the killer problem is not the silicon but the packaging.

The one solution that possibly makes sense - wafer-scale integration - forget about chopping up a wafer into die, packaging them into plastic and soldering them onto boards and putting the boards into a massive card cage - just do as much of your wiring as possible on a single die - the nice thing about this sort of a problem is that the engines are pretty self contained - routing around die defects would be really easy just lay out your engines in a grid - talk to one at a corner and work your way in, when you find one that's dead (maybe you mapped it with scan when you brought the wafer up) you just ignore it and route around it. In this case a grid based connection matrix makes more sense than a tree-based one.

So what can we get on a wafer? - a 7 inch wafer has an area of 24000 sq mm - assuming a single engine is 0.7sq mm and 90% of the wafer area is available and we get a 90% yield on engines - then we get ~27,000 engines/die. We only need 890 wafers and we've got our 24 million engines - it'll fit in a refrigerator sized cabinet (considering the power it's going to burn it had better be a refrigerator). 7 days later we've cracked the code.

Now before you run out and build yourself one .... remember large well funded companies have gone belly-up working on wafer-scale ....

On the other hand the NSA probably has rooms and rooms full of just such engines ... so before you start in with that 64-bit key .... remember who might be watching .... and use the 96-bit one instead ..... that way they'll need 3500 billion wafers to crack it .... (oh yeah I forgot maybe they know something we don't ....).