Archive for September, 2008

CCL on x8632: more registers, please

Friday, September 19th, 2008

The biggest problem with porting CCL to the x8632 architecture is that the architecture has so few registers. I need to talk a little bit about how CCL’s GC works in order to explain why the small number of registers is a problem.

CCL is designed to use pre-emptively scheduled OS threads. This means that a context switch can happen at any instruction boundary. Since other threads might allocate memory, this means that a GC can happen between any two instructions, too.

(The following summarizes information found in the Implementation Details of Clozure CL chapter of the CCL manual.)

The GC in CCL is precise. That is, it believes that it can always tell whether a register or a stack location contains a lisp value or just raw bits. In order it to enable it to do this, we have to adopt and follow strict conventions on register and stack usage.

What we do is to partition the machine registers into two sets: one that will always contain raw, unboxed values (“immediates”), and another that will always contain tagged lisp objects (“nodes”).

This works fine on architectures that have a reasonable number of registers. On x8632, we’re so register-starved that it’s impossible to get by with a static partitioning of registers.

We therefore keep a bit mask in thread-local memory that indicates whether a given register is a node or an immediate, and have the GC consult these bits when it runs. This allows us to switch the class of a register at run time.

The default register partitioning looks like this:

  • We have a single “immediate” register. EAX is given the symbolic name %imm0.
  • There are two “dedicated” registers. ESP and EBP have dedicated functionality dictated by the hardware and calling conventions.
  • The remaining 5 registers are “node” registers (%temp0, %temp1, %arg_y, %arg_z, and %fn). We don’t use the x86 string instructions which implicitly use ESI and EDI.

Most of the time, all we need to do is to steal a node register and mark it as an immediate for a couple of instructions. Typically this is because we need to index some foreign pointer, or use MUL or DIV to produce extended-precision results.

Here’s an example of a case where we have to do this.

(defx8632lapfunction %%get-unsigned-longlong ((ptr arg_y) (offset arg_z))
  (trap-unless-typecode= ptr x8632::subtag-macptr)
  (mark-as-imm temp0)
  (let ((imm1 temp0))
    (macptr-ptr ptr imm1)
    (unbox-fixnum offset imm0)
    (movq (@ (% imm1) (% imm0)) (% mm0)))
  (mark-as-node temp0)
  (jmp-subprim .SPmakeu64))

The mark-as-imm macro expands to something like this:

(andb ($ bit-for-temp0) (@ (% :rcontext) x8632::tcr.node-regs-mask))

Here, :rcontext is the register that points to a block of thread-local storage (the thread context record). On x8632, it’s an otherwise useless segment register, typically %fs. (We’d be in real trouble if we had to dedicate a GPR to point to thread-local storage on x8632.)

In simple cases like this, there’s actually another alternative. We don’t use the x86 string instructions, so the direction flag in EFLAGS is otherwise unused. So, what we do is to say that if DF is set, then %edx is an immediate register. So, if we used temp1 (aka EDX) instead of temp0 (aka ECX) in the example above, we could actually replace the mark-as-imm/mark-as-node with the (presumably cheaper) std/cld instruction pair.

In fact, I should probably make that change…

Anyway, many’s the time I wished for just two more registers. I thought about sending a bug report to Intel, but I didn’t figure that I’d get a response.

Clozure CL 1.2 released

Thursday, September 18th, 2008

Clozure CL 1.2 is out now.  It runs on x86-64 and PowerPC processors, under Mac OS X, Linux, and FreeBSD.  (I continue to be surprised by how many people think it runs only on Macintosh systems.)  This is the first official release in over two and a half years.

Fast compiler, native threads, convenient FFI, Unicode, generational GC, etc.  See http://trac.clozure.com/openmcl

One major feature that will be in Clozure CL 1.3 is support for the 32-bit x86 platform.  In fact, an experimental 32-bit lisp is already in the trunk for Darwin/x86.  I worked on the 32-bit Intel port.

It’s probably a little unusual for software to be ported from x86-64 back to x8632.  Anti-progress, as it were.

The existence of an x8664 port made the job quite a bit simpler: one major benefit was that there was already a working assembler (and disassembler, too).  I was also able to use the existing low-level x86-64 assembly language code as a model for what the corresponding 32-bit version should look like.

Another thing I had going for me was that the lisp already ran on the 32-bit PowerPC, so the word size issues were mostly ironed out.

I didn’t really know (or care to know) the x86 architecture all that well before I started working on the port.  I think other architectures (SPARC, MIPS, PowerPC, …) are much nicer targets.

However, the hardware engineers at Intel and AMD are brilliant, and it’s impossible to ignore the performance of the x86 chips that they build.  You just have to hold your nose, study the architecture manuals, and get on with it.

After doing the port, I find it funny that I look on x86-64 as some sort of Nirvana.  (I mentioned this on a private IRC channel, and got the reply “It’s not THAT bad.  Of the 8 or so architectures that I can think of that’re still in use, it’s in the top 7.”)

I’m afraid that it might be a bit boring to read about issues that face the lisp implementer when targeting x8632, but maybe I’ll write a follow-on post with some more details if there’s any interest.