Easy Bytecode
Posted by Kaya Kupferschmidt • Wednesday, February 28. 2007 • Category: Programming
Writing a interpreter for a custom scripting language always seems to be more complex than writing a small bytecode compiler and bytecode interpreter. At a first glance, writing a direct interpreter might be easier, but if the scripting language contains flow control (like loops, if/else statements and similar constructs involving jumps), this turns out to be false. The primary problem lies in the fact that one needs to duplicate large parts of the parser - simply for skipping over the parts of a script that are not executed (like with a conditional if-block whose conditioon turns out to be false at runtime).
Because of this insight, I began to concentrate on writing an easy-to-implement bytecode compiler that would transform a text-based script into a more machine-friendly representation. A positive side-effect of bytecode is that simple fact that it is much faster to execute than the original textual representation. The downside of this approach is the fact that it involves writing a compiler - something that sounds to be a complex and difficult task.
But after analysing the process of parsing a script, I came to the conclusion that such a compiler and its corresponding bytecode interpreter ("virtual machine") would be rather straight-forward and easy to implement, if the underlying model of the virtual machine is chosen carefully. In my opinion the best machine model (in terms of simplicity to implement a compiler and interpreter) it a purely stack-based RPN (Reverse Polish notation) machine. And such a model not only is easy to implement, but it also easy to extract the original syntax tree from the bytecode, which in turn allows further optimisation techniques as a postprocessing step (it even wouldn't be to hard to turn bytecode into native assembler).
Reverse Polish Notation (or short RPN) is a special notation of mathematical operations, wherein every operator follows all of its operands (this is also called postfix notation). This means that instead of writing "3+5", the RPN representation would be "3 5 +".
Even more complex expressions can be easily transformed into the RPN notation, take for example "(2+5) (9-3*2)". This can be represented in RPN as "2 5 + 9 3 2 - *". There is a well-known algorithm (the Shunting yard algorithm) that converts any traditional infix-notation into an RPN expression. This also includes function calls and any sort of constant objects (like strings, numbers etc).
You can easily use the aforementioned Shunting yard algorithm to convert any expression into an RPN notation. In order to get bytecode, you only have to tag each element from the RPN stream corresponding to its type (ie is it an operation or an operand). This bytecode then has to be extended to support (conditional) jumps and some special operations like variable assignments etc.
My implementation does not use the shunting yard algorithm, instead I use a traditional recursive parser which directly can produce the RPN bytecode. Actually the fact that I already had a parser for a simpler interpreter made me think about RPN code, because it turns out to be much easier to produce RPN bytecode in contrast to code for a more advanced virtual machine with registers. Plus the RPN bytecode implicitly still contains the complete abstract syntax tree, which in turn opens the door for aggressive optimisation algorithms, or even a bytecode transformer that produces native assembler code.
The final interpretation of the RPN based bytecode is rather easy, you simply need a stack for all incoming value-type tokens, and then you can easily scan the RPN stream as follows:
Actually I have just implemented a basic bytecode compiler which does not use the shunting-yard algorithm to parse a script, but has a more traditional approach. But it does produce RPN bytecode and the bytecode executor simply is a big switch-statement (with different cases corresponding to different operators or value-types). You can access the scripting engine via SVN as part of the Magnum framework at svn://subversion.dimajix.de/magnum. I will try to package a smaller download version (or a new release of Magnum) soon.
As a reader indirectly pointed out, my system is similar to The FORTH Programming Language in that both use RPN notation (although my scripting engine only uses RPN internally an sticks to the standard notation as its input format) and two stacks (one for data and one for return addresses - the second stack currently is not explicitly implemented in Magnum, but my bytecode intepreter works recursively, so the return addresses are stored on the real CPU stack.)
Because of this insight, I began to concentrate on writing an easy-to-implement bytecode compiler that would transform a text-based script into a more machine-friendly representation. A positive side-effect of bytecode is that simple fact that it is much faster to execute than the original textual representation. The downside of this approach is the fact that it involves writing a compiler - something that sounds to be a complex and difficult task.
But after analysing the process of parsing a script, I came to the conclusion that such a compiler and its corresponding bytecode interpreter ("virtual machine") would be rather straight-forward and easy to implement, if the underlying model of the virtual machine is chosen carefully. In my opinion the best machine model (in terms of simplicity to implement a compiler and interpreter) it a purely stack-based RPN (Reverse Polish notation) machine. And such a model not only is easy to implement, but it also easy to extract the original syntax tree from the bytecode, which in turn allows further optimisation techniques as a postprocessing step (it even wouldn't be to hard to turn bytecode into native assembler).
What does RPN mean?
Reverse Polish Notation (or short RPN) is a special notation of mathematical operations, wherein every operator follows all of its operands (this is also called postfix notation). This means that instead of writing "3+5", the RPN representation would be "3 5 +".
Even more complex expressions can be easily transformed into the RPN notation, take for example "(2+5) (9-3*2)". This can be represented in RPN as "2 5 + 9 3 2 - *". There is a well-known algorithm (the Shunting yard algorithm) that converts any traditional infix-notation into an RPN expression. This also includes function calls and any sort of constant objects (like strings, numbers etc).
How to generate RPN bytecode from a procedural scripting language?
You can easily use the aforementioned Shunting yard algorithm to convert any expression into an RPN notation. In order to get bytecode, you only have to tag each element from the RPN stream corresponding to its type (ie is it an operation or an operand). This bytecode then has to be extended to support (conditional) jumps and some special operations like variable assignments etc.
My implementation does not use the shunting yard algorithm, instead I use a traditional recursive parser which directly can produce the RPN bytecode. Actually the fact that I already had a parser for a simpler interpreter made me think about RPN code, because it turns out to be much easier to produce RPN bytecode in contrast to code for a more advanced virtual machine with registers. Plus the RPN bytecode implicitly still contains the complete abstract syntax tree, which in turn opens the door for aggressive optimisation algorithms, or even a bytecode transformer that produces native assembler code.
How to interpret the final bytecode?
The final interpretation of the RPN based bytecode is rather easy, you simply need a stack for all incoming value-type tokens, and then you can easily scan the RPN stream as follows:
- Take token from stream.
- If token is a value, push it onto stack.
- If token is an operation, pop needed operands from stack, perform operation and finally push result onto stack.
Implementation?
Actually I have just implemented a basic bytecode compiler which does not use the shunting-yard algorithm to parse a script, but has a more traditional approach. But it does produce RPN bytecode and the bytecode executor simply is a big switch-statement (with different cases corresponding to different operators or value-types). You can access the scripting engine via SVN as part of the Magnum framework at svn://subversion.dimajix.de/magnum. I will try to package a smaller download version (or a new release of Magnum) soon.
Similar systems
As a reader indirectly pointed out, my system is similar to The FORTH Programming Language in that both use RPN notation (although my scripting engine only uses RPN internally an sticks to the standard notation as its input format) and two stacks (one for data and one for return addresses - the second stack currently is not explicitly implemented in Magnum, but my bytecode intepreter works recursively, so the return addresses are stored on the real CPU stack.)



4 Comments
Some time ago, I needed good (but easy) script language for my project, where I want users to specify operations on 2D geometric entities (to define measurement and evaluations of camera detected shapes). I planned to use some kind of RPN and FORTH (I called this GeoForth). But users seems confused, they did not like the RPN (probably never tried Texas Instruments calculators
But it would have to meet some special requirements, like the possibility of calling external functions via a simple string-lookup.
If I find some time (not very likely) I might investigate into a FORTH backend.
Add Comment