Carry Propagation


When adding (or subtracting) two numbers, the result at any bit position can depend on the result of all of the less significant bits.   The baseline ripple-carry adder is therefore very slow as it can only complete when the carry information has propagated from the carry input, through the low order bits to the high order bits and the carry output.   This is by far the slowest step in the whole CPU and therefore worthy of careful design.

There are many carry lookahead mechanisms (I like Jacob Abraham's lecture) but all involve a lot more logic which really goes against the stated goals of this project.   I'm hoping that the design below will achieve the necessary speed without significantly increasing the complexity.

Carry lookahead mechanisms construct two signals at every bit position:
  • Gi = Ai Bi is the Generate signal, if Ai and Bi are set then we generate a carry at this position
  • Pi = Ai + Bi is the Propagate signal, if either Ai or Bi are set then we will propagate the carry from Ci to Ci+1
Now we can express the carry out as Ci+1 = GiPi Ci 
Carry lookahead works by expanding this:  Ci+2 = Gi+1 + Pi+1 Ci+1 = Gi+1 + Pi+1 Gi + Pi+1 Pi Ci
Or in general: Cn+1 = Gn + PGn-1 + PPn-1 Gn-2 + ... + PPn-1 ... PG1 PPn-1 ... PC1

The carry out for a block of n inputs needs n+1 diodes (and a transistor) to compute Pn ... P1 C1 then n to compute Pn ... P2 G1  all the way down to 2 to compute Pn Gn-1 that's n(n+3)/2, plus another n+1 to get the Cn+1 making a grand total of (n2 + 5n)/2 +1.    Now we don't just need to do this for the last carry, but all intermediate values as well, so it's O(n3) - yuck.

So let's be a bit smarter about this.   The ALU has a 8T delay anyway and there some time to compute carries without slowing it down.   So let's assume that we can build some logarithmic like structure that computes the carries in exploitable ALU latency and a bit more.   This makes it a O(n2 log(n)) problem which is much better.   In fact, if all the tree structure fitted into the exploitable ALU latency, which it almost can, we would be O(n2).   Now consider where all the hard work is done, it's in computing PPn-1 ... so let:

Qni = PPn-1 ... Pi

which can be computed recursively as:

Qni = Qn i-1 Pi

It's also possible to construct a pair of updates to compute odd and even values of Qni of the form:

Sni = Sn i-2 Pi-1 Pi

So the O(n2) problem is now O(n), much better.

Now we just need to build Qni - well the problem with transistor circuits is that transistors take a while to switch off (they need charge in order to work, it takes a while to drain the charge).   Diodes work much faster, especially Schottky diodes which both switch fast and have low voltage drop.   If we invert Pi then we have a large OR problem which can be implemented in hardware as a chain of diodes with each diode junction being linked to a NOT Pi.   It works, if all NOT Pi are low then the output is low, if any one of the NOT Pi are high then the output is high(ish).

I see 0.83V drop across 5 diodes, so only 2.66V drop if I wanted the full 16 bit lookahead carry there's not quite enough signal, but with a hierarchical structure I don't want to a 16 bit lookahead in one go, so it's fine or I can use the odd/even structure above.   Ten diodes will change state at 500kHz, but that's not great, I've had 900 khz from 4 NOR gates.

So - do Schottky diodes switch faster than BJTs?  Or should I build a Schottky transistor and use ripple carry?


So here's the not-yet-finished proposal:
  • t=0 start with C0
  • t=2 run a 5-block to get C5 and a 1 block to get C1
  • t=4 run another 5-block and a 2-block from C5 to get C10 and C7, C1 goes to C2
  • t=6 run a 4-block and a 2-block from C10 to get C14 and C12, C7 gives C8, C2 gives C3
  • t=8 run a 2-block from C14 to get C16 and C15, C8 gives C9, C3 gives C4
The ALU has a 8T delay without the carry issue, but Pi can be calculated after 4T (check).   So the whole ALU with the above will run in 12T, only 50% more than a "magic" solution (c.f. 8T + 16 * 2T = 40T for the ripple carry).  Maybe the above is too good.