Like many people I've joined the distributed.net RC5-64 crack effort - a wonderfully (relatively) pointless effort - at the current crack rate my lowly Pentium is clocking along at 150,000 keys/sec - at this rate by myself I'll cover the entire 2**64 key key space in about 3 million years and with a million similar systems joining in we could do it in 3 years ..... barring blind-luck we're in it for a long haul ....
But I'm a chip designer ..... building FAST stuff is what I do for a living .... how could we speed up this process? The following is an evening's verilog hacking and some quick back of the envelope calculations - it's all doable (except perhaps the last section's musings on integration).
A quick run through synopsys with a 0.25 micron library gives a design that runs at 100MHz and takes ~17K gates - all this with out even trying hard - I'm sure that spending some time on it could bring the gate count down or the clock speed up (there's an interesting design space to work in here - more engines running slower vs. fewer running faster) - the design is pretty combinatorial all those adders and barrell shifters - and it's hard to pipe well - it looks like a tough nut to crack in hardware (I'm sure it's designed to be that way). Anyway all up, including scan it comes to ~17k gates - which at 0.25u is (very roughly) 0.7 sq mm.
In this design chips are build that contain clusters of key engines (basicly as many as you can economically fit on a die - I've been assuming 8 for much of this, although you might get 1 or 2 in an FPGA) - dies are connected into a tree structure with 13-bit buses connecting them together - 1 13-bit bus come down to each die from above and 2 go out to other dies below this one.
At the top level the first die is connected to a PC's parallel port - from there a controll program in the PC can distribute keys to engines and collect up the results. Each engine runs at 100MHz - and takes ~80 clocks to check a key - this gives a key rate of 1.25M keys/sec - since an individual engine processes 2**32 keys before it needs a new one from the host it runs by itself for about an hour - checking to see if it succeeded and loading a new key probably would take a millisecond or so (probably a lot less - depending on the depth of the tree of dies) - this means that a single PC could support roughly 3 million engines. Clustering the engines into groups of 8 per die we can run them together on larger ranges of keys (more than 2**32) and set them up together - meaning that a PC could support 8x the number of engines - ~24million (this limit is pretty arbitrary it's pretty easy to see how to raise this number).
If we have 8 engines/die this corresponds to 1M dies and a tree depth of ~20.
With 24 million crack engines each doing 1.25 million keys/sec we can walk the entire key space in 7 days - woo hoo! the prize is ours! .... well almost there are just a few details ....
I've put a summary and Verilog source for my design here
Secondly how are we going to mount all this stuff - really big PC boards come to mind, suppose we put 100 chips on each board .... we're still going to need 10,000 boards it's a lot of space - and a wiring nightmare .... it's going in someone else's garage ....
At this point you realize that an 8-bit inter die bus is a problem because of all the copper outside the die - and drop it down to a serial bus - and waste some silicon to manage this instead.
The one solution that possibly makes sense - wafer-scale integration - forget about chopping up a wafer into die, packaging them into plastic and soldering them onto boards and putting the boards into a massive card cage - just do as much of your wiring as possible on a single die - the nice thing about this sort of a problem is that the engines are pretty self contained - routing around die defects would be really easy just lay out your engines in a grid - talk to one at a corner and work your way in, when you find one that's dead (maybe you mapped it with scan when you brought the wafer up) you just ignore it and route around it. In this case a grid based connection matrix makes more sense than a tree-based one.
So what can we get on a wafer? - a 7 inch wafer has an area of 24000 sq mm - assuming a single engine is 0.7sq mm and 90% of the wafer area is available and we get a 90% yield on engines - then we get ~27,000 engines/die. We only need 890 wafers and we've got our 24 million engines - it'll fit in a refrigerator sized cabinet (considering the power it's going to burn it had better be a refrigerator). 7 days later we've cracked the code.
Now before you run out and build yourself one .... remember large well funded companies have gone belly-up working on wafer-scale ....
On the other hand the NSA probably has rooms and rooms full of just such engines ... so before you start in with that 64-bit key .... remember who might be watching .... and use the 96-bit one instead ..... that way they'll need 3500 billion wafers to crack it .... (oh yeah I forgot maybe they know something we don't ....).