I like this introduction Computer architecture: The simplest processor even though the nanoCPU is far simpler.

The processor consists of an instruction decoder, a set of registers, a memory unit, an ALU, and two buses, A and B.   One of the registers serves as a program counter, P, it differs from the other registers only in that it's named in the I = *P++ instruction.   This instruction is a hard wired operation to fetch the real instruction to be executed. That is, a single instuction in memory runs in two phases:the processor alternates between executing the instruction fetch instruction and executing the instruction fetched by the previous instruction fetch instruction - that is:
  • FETCH:  The hardwired "I = *P++" is executed which fetches the next instruction into the instruction register, I, and increments the program counter, P.
  • EXECUTE:  The instruction in the program counter, I, is run.
Both FETCH and EXECUTE run on the same hardware, it's a hack to avoid building extra hardware to load an instruction and increment the program counter as a normal CPU would do.

There are two 3:8 demux, one for bus A and one for bus B.   These select a register to wire to the bus.  The three bit input to these are called selectA and selectB.
With this, the registers can be considered as one block which may be read from or written to.   There two other components that can be read from or written to, the ALU and the RPi.   The naming convention is: who is going to be read from/write to (not who is going to read to or write from the bus).   The status lines that enable reading are writing are:
  • readA, readB to specify that the selectA/B demux are to be using for reading from registers to the A/B bus.
  • readRPi to specify that the RPi is to provide data for the A bus.
  • readALU to specify that the ALU is to provide data for the A bus.
  • writeA to specify that data from the A bus is to be written to the register provided with selectA.
  • writeRPi to specify that data from the A bus is to be written to the RPi.
  • writeALU to specify that both bus A and bus B are to be used to write to the two ALU inputs.
The ALU contains a standard 16 bit register on it's output to keep the result as it's slow and isn't connected to a bus at the time the output is created (so that you can execute instructions like P = *P).   The ALU output is just like a normal register but there is a longer data path to get the information in.   This should look good when single stepping, the selectA and selectB LEDs will show which register was selected from which the ALU inputs may be read off, and the output buffer will show the result.

There are three sorts of instruction:

 ALU Rz = Rx op Ry
 READ Rz = *Rx, Rx +-= Ry
 WRITE *Rx = Rz, Rx +-= Ry
The RPi has as input:
  • readRPi, writeRPi
  • 16 bits from busA (duplex)
The RPi has as output:
  • reset
  • clock0, clock1 and clock2.
  • 16 bits from busA (duplex)
That's 22 bits of IO in total as the 16 bits of data are both input and output (a forced decision due to the RPi3 hardware having 26 GPIO pins).

For each "instruction", i.e. for both odd (real/conventional) and even (I = *P++, fetch) instruction cycles the state machine goes through:
  • clock0: selectA = Rx, selectB = Ry, readA = 1, readB = 1, writeA = 0, GPUcmd set (op or +-) , writeALU = 1, writeRPi =1 # Rx and Ry set up for ALU and Rx and readRPi/writeRPi read by RPi
  • clock1: readA = 0, readB = 0, selectA = set (Rx or Rz), writeA = 1, writeALU = 1, readALU = 1 # result written to register
  • clock2 (skip if ALU): selectA = Rz, set writeA or readA from READ/WRITE type (following clock0  readRPi/writeRPi).  RPi already knows to read/write from clock0,
FINISH:  This doesn't consider conditional statements, nor the need to inc P when not executed.   However, as it stands, there's not too much that is conditional here (three things, where GPUcmd comes from, readRPi/writeRPi and clock1's selectA).   This is a good start, I wanted the instruction decode and execution to be as clean as possible.

RESET:   At t=0, instead of loading I=*P++ the instruction P=0 is loaded which can then jump to where it needs to be, set the carry flag and patch up the bottom few words if needs be.  The memory will already be initialised.

It seems common to design everything in terms of logic gates, each of which have a fixed design in terms of transistors.   I don't always have the logic gate level in my architecture, I drop to the transistor level when I need speed (e.g. the ripple carry) or to minimise component count (the memory cell).   I make no apology for this, I'm designing something that is educational at all levels.

I was expecting the layout to look something like the below, I haven't decided where the RPi will be yet, probably at the top.

If I can build each vertical bit slice of memory and ALU as a component then that would mean that a failure in this area could be replaced easily.  I expect these to take up most of the space and so be the main things that fail - but I have very little idea of the failure rate right now.

The layout is designed so that, with quiet a lot of imagination, you can play 5 x 16 pong (2 bit bat at each end, 1 bit ball at 0, -45 and +45 ball angles) or 16 x 5 breakout, 3 bit bat. 2 lines to delete.   It's all very tight, I did think of moving P down and using the upper bits if it as part of the display (duplicating the code for all upper bit patterns) but that's going a little bit to far....