THERE ARE GOOD IDEAS HERE BUT IT NEEDS REWRITING
When adding (or subtracting) two numbers, the result at any bit position can depend on the result of all of the less significant bits. The baseline ripple-carry adder is therefore very slow as it can only complete when the carry information has propagated from the carry input, through the low order bits to the high order bits and the carry output. This is by far the slowest step in the whole CPU and therefore worthy of careful design.
There are many carry lookahead mechanisms (I like Jacob Abraham's lecture) but all involve a lot more logic which really goes against the stated goals of this project. I'm hoping that the design below will achieve the necessary speed without significantly increasing the complexity.
Carry lookahead mechanisms construct two signals at every bit position:
Now we can express the carry out as Ci+1 = Gi + Pi Ci
Carry lookahead works by expanding this: Ci+2 = Gi+1 + Pi+1 Ci+1 = Gi+1 + Pi+1 Gi + Pi+1 Pi Ci
Or in general: Cn+1 = Gn + Pn Gn-1 + Pn Pn-1 Gn-2 + ... + Pn Pn-1 ... P2 G1 + Pn Pn-1 ... P1 C1
The carry out for a block of n inputs needs n+1 diodes (and a transistor) to compute Pn ... P1 C1 then n to compute Pn ... P2 G1 all the way down to 2 to compute Pn Gn-1 that's n(n+3)/2, plus another n+1 to get the Cn+1 making a grand total of (n2 + 5n)/2 +1. Now we don't just need to do this for the last carry, but all intermediate values as well, so it's O(n3) - yuck.
So let's be a bit smarter about this. The ALU has a 8T delay anyway and there some time to compute carries without slowing it down. So let's assume that we can build some logarithmic like structure that computes the carries in exploitable ALU latency and a bit more. This makes it a O(n2 log(n)) problem which is much better. In fact, if all the tree structure fitted into the exploitable ALU latency, which it almost can, we would be O(n2). Now consider where all the hard work is done, it's in computing Pn Pn-1 ... so let:
which can be computed recursively as:
It's also possible to construct a pair of updates to compute odd and even values of Qni of the form:
So the O(n2) problem is now O(n), much better.
Now we just need to build Qni - well the problem with transistor circuits is that transistors take a while to switch off (they need charge in order to work, it takes a while to drain the charge). Diodes work much faster, especially Schottky diodes which both switch fast and have low voltage drop. If we invert Pi then we have a large OR problem which can be implemented in hardware as a chain of diodes with each diode junction being linked to a NOT Pi. It works, if all NOT Pi are low then the output is low, if any one of the NOT Pi are high then the output is high(ish).
I see 0.83V drop across 5 diodes, so only 2.66V drop if I wanted the full 16 bit lookahead carry there's not quite enough signal, but with a hierarchical structure I don't want to a 16 bit lookahead in one go, so it's fine or I can use the odd/even structure above. Ten diodes will change state at 500kHz, but that's not great, I've had 900 khz from 4 NOR gates.
So - do Schottky diodes switch faster than BJTs? Or should I build a Schottky transistor and use ripple carry?
BELOW DEPENDS ON ABOVE:
So here's the not-yet-finished proposal:
The ALU has a 8T delay without the carry issue, but Pi can be calculated after 4T (check). So the whole ALU with the above will run in 12T, only 50% more than a "magic" solution (c.f. 8T + 16 * 2T = 40T for the ripple carry). Maybe the above is too good.