CCL on x8632: more registers, please

The biggest problem with porting CCL to the x8632 architecture is that the architecture has so few registers. I need to talk a little bit about how CCL’s GC works in order to explain why the small number of registers is a problem.

CCL is designed to use pre-emptively scheduled OS threads. This means that a context switch can happen at any instruction boundary. Since other threads might allocate memory, this means that a GC can happen between any two instructions, too.

(The following summarizes information found in the Implementation Details of Clozure CL chapter of the CCL manual.)

The GC in CCL is precise. That is, it believes that it can always tell whether a register or a stack location contains a lisp value or just raw bits. In order it to enable it to do this, we have to adopt and follow strict conventions on register and stack usage.

What we do is to partition the machine registers into two sets: one that will always contain raw, unboxed values (“immediates”), and another that will always contain tagged lisp objects (“nodes”).

This works fine on architectures that have a reasonable number of registers. On x8632, we’re so register-starved that it’s impossible to get by with a static partitioning of registers.

We therefore keep a bit mask in thread-local memory that indicates whether a given register is a node or an immediate, and have the GC consult these bits when it runs. This allows us to switch the class of a register at run time.

The default register partitioning looks like this:

We have a single “immediate” register. EAX is given the symbolic name %imm0.
There are two “dedicated” registers. ESP and EBP have dedicated functionality dictated by the hardware and calling conventions.
The remaining 5 registers are “node” registers (%temp0, %temp1, %arg_y, %arg_z, and %fn). We don’t use the x86 string instructions which implicitly use ESI and EDI.

Most of the time, all we need to do is to steal a node register and mark it as an immediate for a couple of instructions. Typically this is because we need to index some foreign pointer, or use MUL or DIV to produce extended-precision results.

Here’s an example of a case where we have to do this.

(defx8632lapfunction %%get-unsigned-longlong ((ptr arg_y) (offset arg_z))
  (trap-unless-typecode= ptr x8632::subtag-macptr)
  (mark-as-imm temp0)
  (let ((imm1 temp0))
    (macptr-ptr ptr imm1)
    (unbox-fixnum offset imm0)
    (movq (@ (% imm1) (% imm0)) (% mm0)))
  (mark-as-node temp0)
  (jmp-subprim .SPmakeu64))

The mark-as-imm macro expands to something like this:

(andb ($ bit-for-temp0) (@ (% :rcontext) x8632::tcr.node-regs-mask))

Here, :rcontext is the register that points to a block of thread-local storage (the thread context record). On x8632, it’s an otherwise useless segment register, typically %fs. (We’d be in real trouble if we had to dedicate a GPR to point to thread-local storage on x8632.)

In simple cases like this, there’s actually another alternative. We don’t use the x86 string instructions, so the direction flag in EFLAGS is otherwise unused. So, what we do is to say that if DF is set, then %edx is an immediate register. So, if we used temp1 (aka EDX) instead of temp0 (aka ECX) in the example above, we could actually replace the mark-as-imm/mark-as-node with the (presumably cheaper) std/cld instruction pair.

In fact, I should probably make that change…

Anyway, many’s the time I wished for just two more registers. I thought about sending a bug report to Intel, but I didn’t figure that I’d get a response.