Monday, March 10, 2008

Episode 2: Poking in Compiler Guts

In the first episode, we introduced the Parrot Compiler Tools, and generated a very simple language using a shell script provided with the Parrot distribution. We also announced Squaak, a simple programming language developed specially for this tutorial. Squaak will be the case study to show how PCT can be used as a very effective set of tools to implement a language for Parrot. A list of features of Squaak was specified. If you felt lucky, you might even have tried to do the exercise at the end of the previous episode.

In this episode, we will take a closer look at the generated compiler. We shall check out the different stages of the compilation process, and show what's going on in PCT-based compilers.

Under the Hood
Remember how we invoked our compiler in the previous episode? We can pass a file, or invoke the compiler without a command line argument, in which case our compiler enters the interactive mode. Consider the first case, passing the file test.sq, just as we did before:
$ ../../parrot squaak.pbc test.sq
When invoking our compiler like this, the file test.sq is compiled and the generated code (bytecode) is executed immediately by Parrot. How does this work, you might wonder. The interpretation of a script is done through a series of transformations, starting at the script source and ending in a format that can be executed by Parrot. Compilers built with the PCT (based on the HLLCompiler class) can take a target option, to show one of the intermediate representations. This option can have the following values, corresponding to the four default compilation phases of an HLLCompiler object:
  • --target=parse
  • --target=past
  • --target=post
  • --target=pir
This is an example of using the target option set to "parse", which will print the parse tree of the input to stdout:
$ ../../parrot squaak.pbc --target=parse test.sq
In interactive mode, giving this input:
say 42;
will print this parse tree (without the line numbers):
 1 "parse" => PMC 'Squaak::Grammar' => "say 42;\r\n" @ 0 {
2 <statement> => ResizablePMCArray (size:1) [
3 PMC 'Squaak::Grammar' => "say 42;\r\n" @ 0 {
4 <value> => ResizablePMCArray (size:1) [
5 PMC 'Squaak::Grammar' => "42" @ 4 {
6 <integer> => PMC 'Squaak::Grammar' => "42" @ 4
7 }
8 ]
9 }
10 ]

11 }
When changing the value of the target option, the output changes into a different representation of the input. Why don't you try that right now?

So, a HLLCompiler object has four compilation phases: parsing, construction of a Parrot Abstract Syntax Tree (PAST), construction of a Parrot Opcode Syntax Tree (POST), generation of Parrot Intermediate Representation (PIR). After compilation, the generated PIR is executed immediately.

If your compiler needs additional stages, you can add them to your HLLCompiler object. For Squaak, we will not need this, but for details, check out compilers/pct/src/pct/HLLCompiler.pir.

We shall now discuss each compilation phase in more detail. The first two phases, parsing the input and construction of the PAST are executed simultaneously. Therefore, these are discussed together.

Parse phase: match objects and PAST construction
During the parsing phase, the input is analyzed using Perl 6's extended regular expressions, known as Rules (see Synopsis 5 for details). When a rule matches some input string, a so-called Match object is created. A Match object is a combined array and hashtable, implying it can be indexed by integers as well as strings. As rules typically consist of other (sub) rules, it is easy to retrieve a certain part of the match. For instance, this rule:
rule if_statement {
'if' <expression> 'then' <statement> 'end'
{*}
}
has two other subrules: expression and statement. The match object for the rule if_statement represents the whole string from if to end. When you're interested only in the expression or statement part, you can retrieve that by indexing the match object by the name of the subrule (in this case, expression and statement, respectively).

During the parse phase, the PAST is constructed. There is a small set of PAST node types, for instance, PAST::Var to represent variables (identifiers, such as print), PAST::Val to represent literal values (for instance, "hello" and 42), and so on. Later we shall discuss the various PAST nodes in more detail.

Now, you might wonder, at which point exactly is this PAST construction happening? This is where the special {*} symbol comes in, just below the string 'if' in the if_statement rule shown above. These special markers indicate that a parse action should be invoked. Such a parse action is just a method that has the same name as the rule in which it is written (in this case: if_statement). So, during the parsing phase, several parse actions are executed, each of which builds a piece of the total PAST representing the input string. More on this will be explained later.

The Parrot Abstract Syntax Tree is just a different representation of the same input string (your program being compiled). It is a convenient data structure to transform into something different (such as executable Parrot code) but also to do all sorts of analysis, such as compile-time type checking.

PAST to POST
After the parse phase during which the PAST is constructed, the HLLCompiler transforms this PAST into something called a Parrot Opcode Syntax Tree (POST). The POST representation is also a tree structure, but these nodes are on a lower abstraction level. For instance, on the PAST level there is a node type to represent a while statement (constructed as PAST::Op.new( :pasttype('while') )). The template for a while statement typically consists of a number of labels and jump instructions. On the POST level, the same while statement is represented by a set of nodes, each representing a one instruction or a label. Therefore, it is much easier to transform a POST into something executable than when this is done from the PAST level.
Usually, as a user of the PCT, you don't need to know details of POST nodes, which is why this will not be discussed in further detail. Use the target option to see what a POST looks like.

POST to PIR
In the fourth (and final) stage, the POST is transformed into Parrot Intermediate Representation (PIR). As mentioned, transforming a POST into something executable is rather straightforward, as POST nodes already represent individual instructions and labels. Again, normal usage of the PCT does not require you to know any details about this transformation.

And now for the good news...
We established the general data flow of PCT-based compilers, which consists of four stages:
  1. source to parse tree
  2. parse tree to PAST
  3. PAST to POST
  4. POST to PIR
where we noted that the first two are done during the parse stage. Now, as you're reading this tutorial, you're probably interested in using the PCT for implementing Your Favorite Language for Parrot. We already saw that a language grammar is expressed in Perl 6 Rules. What about the other transformations?
Well, earlier in this episode we mentioned the term parse actions, and that these actions create PAST nodes. After you have written a parse action for each grammar rule, you're done!

Say what?

That's right. Once you have correctly constructed a PAST, your compiler can generate executable PIR, which means you just implemented your first language for Parrot. Of course, you still need to implement any language specific libraries, but that's besides the point.

PCT-based compilers already know how to transform a PAST into a POST, and how to transform a POST into PIR. These transformation stages are already provided by the PCT.

What's next?
In this episode we took a closer look at the internals of a PCT-based compiler. We discussed the four compilation stages, that transform an input string (a program, or script, depending on your definition) into a PAST, a POST and finally executable PIR.
The next episodes is where the Fun Stuff is: we will be implementing Squaak for Parrot. Piece by piece, we will implement the parser and the parse actions. Finally, we shall demonstrate John Conway's "Game of Life" running on Parrot, implemented in Squaak.

Exercises
Last episode's exercise was to add a command line banner and prompt for the interactive mode of our compiler. Given the hints that were provided, it was probably not too hard to find the solution, which is shown below. This subroutine onload can be found in the file Squaak.pir. The relevant lines are printed in bold.
.sub 'onload' :anon :load :init
load_bytecode 'PCT.pbc'

$P0 = get_hll_global ['PCT'], 'HLLCompiler'
$P1 = $P0.'new'()
$P1.'language'('Squaak')
$P1.'parsegrammar'('Squaak::Grammar')
$P1.'parseactions'('Squaak::Grammar::Actions')

$P1.'commandline_banner'("Squaak for Parrot VM\n")
$P1.'commandline_prompt'('> ')

.end
Starting in the next episode, the exercises will be more interesting. For now, it would be useful to browse around through the source files of the compiler, and see if you understand the relation between the grammar rules in grammar.pg and the methods in actions.pm.
It's also useful to experiment with the --target option described in this episode. If you don't know PIR, now is the time to do some preparation for that. There's sufficient information to be found on PIR, see the References section for details. In the mean time, if you have any suggestions, questions and whatnot, don't hesitate to leave a comment.

References
  1. PIR language specification: docs/pdds/draft/PDD19_pir.pod
  2. PIR articles: docs/art/*.pod
License
The source code in this tutorial has been released by the author into the public domain. Where this is not possible by law, the author grants license to use this file for any reason without any rights reserved, and with no warranty express or implied or fitness for a particular purpose.

6 comments:

Proclus said...

this is really great!! I look forward to seeing more examples! I will join the fun soon I hope!

arcady said...

Nice series so far, I'm really looking forward to seeing the rest. One issue though: I know we can go and look up PIR in the docs, but it'd be nice to have a quick overview here anyway, just a short explanation of what PIR is and the sorts of things you can do with it, from the point of view of code that you'd see in the output of a compiler or in language builtins.

kjs said...

I'm a bit hesitant to write more on PIR, as there's already a lot of docs to be found on that. However, what I can do is to explain PIR a bit more inline with the rest of the tutorial (whenever we write PIR, which is not a lot).

ChrisDolan said...

When I set

$P1.'commandline_prompt'('> ')

then Parrot prompts with "> > ". It seems that HLLCompiler prints a "> " after the prompt no matter what. Is this a bug, or did I misunderstand?

kjs said...

Chris,

that seems odd. What happens if you remove the line to set the prompt?
Did you check out a fresh parrot? Could you try that?

Someone sent a patch the the mailing list to make "> " a default prompt, but I don't think it was applied (and IMHO, it shouldn't).

You can also mail me, if you need detailed help.

Joeri said...

I can confirm what chrisdolan said with parrot-0.6.2
It also showed "> " by default, before I changed anything to squaak.pir

The culprit seems to be in compilers/pct/src/PCT/HLLCompiler.pir
(notice that the second occurence of PCT is uppercase, which apparently changed since you wrote this post)
line 502 reads:
code = stdin.'readline'('> ')

which probably would print a prompt in all interactive cases