Friday, October 31, 2008

Register allocation in a PIR compiler

As some of you may know, I have been working on a new implementation of a PIR compiler, which is named PIRC. In the end, my plan is to emit actual Parrot Byte Code (PBC), which can then be run by Parrot. Once everything's working and tested properly, I will make a case for replacing the current PIR compiler (IMCC). It might take a while before I get to that, though.

In the mean time, I'm trying to make PIRC as feature-complete as possible, and will try to write up some of the implementation issues that have to be dealt with. This can be interesting for lurkers, who 'have always wanted to know how to do it', but also just as a way to document things.

In this article, I will discuss the implementation of a recently added feature: a register allocator. Now, the implementation I came up with still needs a lot of testing, but the idea behind it does not change. The algorithm I chose is Linear Scan Register (LSR) Allocation, which is described in detail here. This algorithm is very different from the more well-known (classic, if you like) register coloring algorithm, which is based on graph coloring theory. I won't go into details too much, except mentioning that while a register coloring algorithm can yield more efficient register usage, but is a lot more expensive in terms of processor cycles, compared to the LSR algorithm. Obviously, for dynamic languages, where runtime compilation is a common feature, this is an important feature.

Register allocation
First, a short review of register allocation. I will make some rough simplifications just for the sake of this discussion, so I probably will make some false statements here. In real CPUs, there's a limited number of registers. This is fine, as long as the number of registers is larger than the number of variables that are alive. However, once you have more variables than registers, you need to overwrite a previously used register. If the value in the register being reused for another variable is going to be used after this overwriting, then you need to make sure the value is stored somewhere temporarily, typically in memory. This is called register spilling. Then, later, when the variable that was 'spilled' is referenced again, it needs to be reloaded into a register, overwriting that register's current value, which again needs to be spilled.

In short, you typically have to map N variables onto M registers, and if N > M, you need to spill registers.

In the Parrot virtual machine, this is not necessary. Parrot allows a subroutine to have as many registers as possible (up to the physical limitations of your memory, obviously). If a subroutine needs 50 registers, then it will happily allocate those. After all, Parrot can be considered a CPU in software, which uses memory for anything where a hardware CPU uses... eh... well, hardware. (Of course, memory chips are hardware... oh well, you get my point).

In any case, this is good news for any register allocator in the PIR compiler, as it makes the job of register allocation extremely simple. PIRC has a 'vanilla' register allocator, which allocates registers using a counter, starting at 0, and increment the counter whenever it needs a new register.

However, it's still a bit of a waste. If you look at typical PIR output of Rakudo, the Perl 6 implementation running on Parrot, then you'll see it's a lot of code. A Lot. It would be good if registers can actually be reused, if the variable that it represents is never referenced again. After all, as you now know, Parrot allocates as many registers as it needs for any given subroutine, and obviously, less is better, as it saves memory.

(Ok, I admit it, I really just did it because it was a nice challenge to implement).

Now you know why register optimization is useful, let's discuss some basics of the algorithm that was implemented.

Linear Scan Register Allocation
The Linear Scan Register Allocation (LSR) algorithm is really quite simple. The basic idea is that, each variable has a certain life time. In a programming language, a variable typically has a scope, which is the maximum life span of a variable (because it will go out of scope after the scope has closed. Are you still with me?) . However, the fact that a variable has a scope doesn't mean it's being used throughout the whole scope; it's very possible, and likely, that a variable is only used in one or two statements in the scope. So, suppose there are two variables, say, foo and bar, which have mutual exclusive life spans, (this means they never live in the same 'era'). This means they can both be mapped to the same register. This is the fundamental idea behind the SLR allocator.

Data structures
First, I need to explain some basic data structures that are used in PIRC. I won't do this in too much detail; that would be a good topic for some other time.
PIRC has basically two important parts: the parser, and the back-end. The parser is implemented using a Bison (I think it uses some features that are not Yacc-compatible) specification. The back-end is a bunch of data structures that are manipulated during the parse.
In the context of register allocation, the most important data structures are the structs target, symbol and pir_reg. A symbol is a declared .local identifier; a pir_reg represents a PIR symbolic register (like $I42). A target represents a so-called l-value object; this is an object that can be assigned to. Basically, symbols and pir_reg objects have a one-to-one mapping w.r.t. the symbol or symbolic register they represent, while targets have a many-to-one relation w.r.t. the underlying symbol or pir_reg. So, suppose you declare a .local int foo, then there will be a single symbol object representing foo, but there may be many target nodes pointing to this symbol.
As symbols and pir_regs are only created once for the target they represent, the vanilla register allocator (which is a very basic, dumb allocator) will assign a new register to each new symbol/pir_reg object. Whenever a target is being parsed (and it's easy to recognize whether it's a identifier or a symbolic register), the symbol or pir_reg object for that target is looked up, and linked to the target node.
This link is very important, because it is used to tell the symbol (from now on, wherever you read symbol, you can interpret it as 'symbol or pir_reg') that it is used in the current instruction. During the parse, the compiler keeps track of an instruction counter, which just assigns numbers in a consecutive way to the instructions. So, basically whenever a target is parsed, it updates the life span of the symbol. The first usage of an identifier indicates the first usage of that symbol (and not its declaration: you may declare a symbol, but never use it).
What this means is, that after the parse, for each symbol (.local identifier and symbolic register), we have a life interval object, which knows when is the first usage of the symbol, and when is the last.

Once you know when the variables are used, it becomes extremely simple to optimize the register usage. Basically, you iterate through each life interval object, you assign the symbol it represents a new register (a real parrot register like P2, not a symbolic register, like $P2), and you place the live interval object into a list of "active" variables. Each iteration, you walk through this list, and "expire" any live intervals that represent symbols that are no longer used. For this, the interval objects that you're iterating over must be sorted. Whenever a live interval has expired, the register it was assigned becomes available, so you can put that register on a "free" list; the next time you need a new register, you first check whether there's any "old", previously used registers are available. If not, you just increment a counter, increasing the total number of registers that Parrot must allocate.

This all might sound very complex, but, really, it's not. The implementation is only about 200 lines of code or so (that's a wild guess), and can be found in compilers/pirc/new/pirregalloc.c.

Of course, the thing needs to be properly tested, but the basic idea stays the same. There's still quite some work to be done in PIRC anyway.

In this article I tried to explain the implementation of a register allocator, which optimizes the register usage of code generated by PIRC, a new PIR compiler. The basic data structures were briefly explained, and the algorithm was summarized. Experience should show whether the investment was worth it, or that more heavy-weight implementations, such as graph-coloring-based algorithm should be used.


Dave said...

Hope you don't mind if I mention a couple of points. Presumably these "freed" registers are being spilled to memory, meaning that they are assigned to a stack location. I don't see how that can be a benefit if the virtual register is just a memory location that you are copying to another memory location and back. This algorithm would seem to be most beneficial if only applied to the first n virtual registers, where n is the number of hardware registers available for use in this way.

And as long as we are talking about registers, can I ask a related question? I've just been reading about Parrot for the first time and it looks like the call instructions are all monolithic. One of the advantages of a register-based virtual machine is that calls can be made more efficiently by passing the parameters in registers, but to do that, you have to decompose the call down into separate tasks of setting the argument registers and then branching to the target address. It looks like Parrot does not do that. Is that right?

Dave said...

Never mind. Something about this was nagging me, so I went back and read it again, and it looks like you are not spilling registers at all, but just reusing dead registers. So my comment doesn't apply.

That's the problem with assuming that you know what you are going to read before you read it. :-)