THERE ARE GOOD IDEAS HERE BUT IT NEEDS REWRITING There are many carry lookahead mechanisms (I like Jacob Abraham's lecture) but all involve a lot more logic which really goes against the stated goals of this project. I'm hoping that the design below will achieve the necessary speed without significantly increasing the complexity.
Carry lookahead mechanisms construct two signals at every bit position:
- G
_{i}= A_{i }B_{i}is the Generate signal, if A_{i}and B_{i}are set then we generate a carry at this position - P
_{i}= A_{i}+ B_{i}is the Propagate signal, if either A_{i}or B_{i}are set then we will propagate the carry from C_{i}to C_{i+1}
Now we can express the carry out as C
_{i+1} = G_{i} + P_{i} C_{i} Carry lookahead works by expanding this: C _{i+2} = G_{i+1} + P_{i+1} C_{i+1 = }G_{i+1} + P_{i+1 }G_{i} + P_{i+1 }P_{i} C_{i}Or in general: C _{n+1} = G_{n} + P_{n }G_{n-1} + P_{n }P_{n-1 }G_{n-2} + ... + P_{n }P_{n-1 }... P_{2 }G_{1}_{ }+ P_{n }P_{n-1 }... P_{1 }C_{1}_{}The carry out for a block of n inputs needs n+1 diodes (and a transistor) to compute P
_{n} ... P_{1} C_{1 }then n to compute P_{n} ... P_{2} G_{1} all the way down to 2 to compute P_{n} G_{n-}_{1} that's n(n+3)/2, plus another n+1 to get the C_{n+1 }making a grand total of (n^{2} + 5n)/2 +1. Now we don't just need to do this for the last carry, but all intermediate values as well, so it's O(n^{3}) - yuck.So let's be a bit smarter about this. The ALU has a 8T delay anyway and there some time to compute carries without slowing it down. So let's assume that we can build some logarithmic like structure that computes the carries in exploitable ALU latency and a bit more. This makes it a O(n ^{2} log(n)) problem which is much better. In fact, if all the tree structure fitted into the exploitable ALU latency, which it almost can, we would be O(n^{2}). Now consider where all the hard work is done, it's in computing P_{n }P_{n-1 }... so let:
It's also possible to construct a pair of updates to compute odd and even values of Q _{ni }of the form:
^{2}) problem is now O(n), much better.Now we just need to build Q _{ni} - well the problem with transistor circuits is that transistors take a while to switch off (they need charge in order to work, it takes a while to drain the charge). Diodes work much faster, especially Schottky diodes which both switch fast and have low voltage drop. If we invert P_{i} then we have a large OR problem which can be implemented in hardware as a chain of diodes with each diode junction being linked to a NOT P_{i}. It works, if all NOT P_{i} are low then the output is low, if any one of the NOT P_{i} are high then the output is high(ish).I see 0.83V drop across 5 diodes, so only 2.66V drop if I wanted the full 16 bit lookahead carry there's not quite enough signal, but with a hierarchical structure I don't want to a 16 bit lookahead in one go, so it's fine or I can use the odd/even structure above. Ten diodes will change state at 500kHz, but that's not great, I've had 900 khz from 4 NOR gates. So - do Schottky diodes switch faster than BJTs? Or should I build a Schottky transistor and use ripple carry? BELOW DEPENDS ON ABOVE: So here's the not-yet-finished proposal:
- t=0 start with C0
- t=2 run a 5-block to get C5 and a 1 block to get C1
- t=4 run another 5-block and a 2-block from C5 to get C10 and C7, C1 goes to C2
- t=6 run a 4-block and a 2-block from C10 to get C14 and C12, C7 gives C8, C2 gives C3
- t=8 run a 2-block from C14 to get C16 and C15, C8 gives C9, C3 gives C4
The ALU has a 8T delay without the carry issue, but P
_{i} can be calculated after 4T (check). So the whole ALU with the above will run in 12T, only 50% more than a "magic" solution (c.f. 8T + 16 * 2T = 40T for the ripple carry). Maybe the above is too good. |