The processor needed a language that was higher level than assembler but was simple enough to understand in it's entirity (including the compiler) and yet with minimal overhead over assembly language.   I'm writing that language right now, it is called Third as it's a plausible prequel to Forth.

 Forth Third
 Is a compiler, editor, operating system and filing system is just the compiler
 Is purely stack based uses the stack for input and output of a function but takes static arguments from 


The other '80's language for micros is FORTH.   I always wanted to learn but somehow always had something better to do with my time (maybe the same should be true today).

Brad Rodriguez has written an excellent explanation of Design Decisions in the Forth Kernel which is well worth reading.

Implementing a direct threaded interpreter is quite efficient, but firstly some terminology.   A native function is one written in assembler, just a sequence of instructions to execute (ending in NEXT not RETURN, which is explained below).   A derived function starts with a small amount of native code (called ENTER) and is then followed by a number of addresses each of which is the address of another function (either native or derived).   Derived functions end in the address of EXIT.

Implementing this needs three special registers, which are called FP, SP and DS for now (later these will be mapped to real registers, e.g. C, D and E)
  • FP is the function pointer, it points to the address of the next function to be run.  It operates as an ascending list.
  • FS is the function stack, it keeps FP when another derived function is called.  Here it is implemented as an descending stack.
  • DS is the data stack, the location of the data which is popped and pushed on each function call.  Again it is implemented as an descending stack.
This leaves two registers, (probably A and B) available for general use.   It's tight but doable.   Long functions should store FP to get a register.

Forth has three bits of code which together are called the inner interpreter, these are NEXT, ENTER and EXIT.

NEXT is the code that ends every native function by jumping to the next one in the list.   You can think of it as a RET from this subroutine and a JSR to the next one.  It is always appended to every native function (it's too small to be worth jumping to).   This is very efficient, so provided programs are shallow (not too many nested calls until you get to native code) then this FORTH implemmntation should be very efficient (c.f. indirect threading).   As I don't intend to write huge programs, they should be shallow and efficient.
  • P = *FP++    # read the next function in the list and jump to it, setting up FP to point to the next function
ENTER (also called DOCOL or DOCOLON) is the native code that starts every derived function.   It sets up the interpreter to read function addresses from the memory after this function address (the code below is only three words so there's no point in calling a subroutine with the address of the code to run).
  • *FS-- = FP    # push FP onto the stack as we are starting a new function and don't want to lose it.
  • FP = P + 1    # set up FP for this function to point to immediately after this code
  • P = *FP++     # jump to the first of the function addresses in this derived function (NEXT)
EXIT is the last function in a derived function, it pops the context from a stack and calls NEXT on the layer below.
  • FS = FS + 1
  • FP = *FS  # pop FP from FS
  • P = *FP++    # execute the NEXT code
For example  ADD is a native function
  • A = *DS++      # pop the top of the data stack to A
  • B = *DS          # pop the top of the data stack to B
  • A = A + B       # do ALU function
  • *DS = A         # store the result on the top of the stack
  • P = *PF++     # NEXT
The instruction set has the limitation that instructions can only make one memory access.   So given that the operands are in main memory (and not in registers) then it is always going to take two reads, one write and one RETURN, so four instructions are the best that's possible and FORTH has little overhead here.

MUL is somewhat tedious, especially in the loop unrolled version:
  • A = *DP++     # first argument to MUL
  • B = *DP++     # second argument to MUL
  • *SP-- = C       # push FP onto the stack as we need the space
  • *SP   = D       # push DP onto the stack as we need the space
  • C = Z             # result - initialise to zero
  • D = U             # bit mask - initialise to 1
  • Z = A | D        # test lower bit
  • gt0 C = C + B # if lower bit set add in A (shifted)
  • B = B + B        # shift up bit mask
  • D = D + D       # shift up bit mask
  • Z = A | D         # test next lowest bit
  • gt0 C = C + B # if bit set add in A (shifted)
  • ...
  • gt0 C = C + B  # if top bit set add in A (shifted)
  • *--DP++ = A    # store the result on the top of the stack
  • D = *SP++      # reclaim DP
  • C = *SP          # reclaim FP
  • P = *FP++      # NEXT

It's going to be tight on registers for doing more complicated things.  Need to look at some of the FORTH implementations to see how much of a limitation this is.   The minimal transistor count constraint meant that there weren't going to be spare registers and so spilling to a data stack was inevitable.   It all looks quite efficient.

eForth seems to be a good minimal FORTH to get running first.   Then extend it using JONESFORTH.