Easy Bytecode

Posted by Kaya Kupferschmidt • Wednesday, February 28. 2007 • Category: Programming
Writing a interpreter for a custom scripting language always seems to be more complex than writing a small bytecode compiler and bytecode interpreter. At a first glance, writing a direct interpreter might be easier, but if the scripting language contains flow control (like loops, if/else statements and similar constructs involving jumps), this turns out to be false. The primary problem lies in the fact that one needs to duplicate large parts of the parser - simply for skipping over the parts of a script that are not executed (like with a conditional if-block whose conditioon turns out to be false at runtime).

Because of this insight, I began to concentrate on writing an easy-to-implement bytecode compiler that would transform a text-based script into a more machine-friendly representation. A positive side-effect of bytecode is that simple fact that it is much faster to execute than the original textual representation. The downside of this approach is the fact that it involves writing a compiler - something that sounds to be a complex and difficult task.

But after analysing the process of parsing a script, I came to the conclusion that such a compiler and its corresponding bytecode interpreter ("virtual machine") would be rather straight-forward and easy to implement, if the underlying model of the virtual machine is chosen carefully. In my opinion the best machine model (in terms of simplicity to implement a compiler and interpreter) it a purely stack-based RPN (Reverse Polish notation) machine. And such a model not only is easy to implement, but it also easy to extract the original syntax tree from the bytecode, which in turn allows further optimisation techniques as a postprocessing step (it even wouldn't be to hard to turn bytecode into native assembler).

What does RPN mean?


Reverse Polish Notation (or short RPN) is a special notation of mathematical operations, wherein every operator follows all of its operands (this is also called postfix notation). This means that instead of writing "3+5", the RPN representation would be "3 5 +".

Even more complex expressions can be easily transformed into the RPN notation, take for example "(2+5) (9-3*2)". This can be represented in RPN as "2 5 + 9 3 2 - *". There is a well-known algorithm (the Shunting yard algorithm) that converts any traditional infix-notation into an RPN expression. This also includes function calls and any sort of constant objects (like strings, numbers etc).

How to generate RPN bytecode from a procedural scripting language?


You can easily use the aforementioned Shunting yard algorithm to convert any expression into an RPN notation. In order to get bytecode, you only have to tag each element from the RPN stream corresponding to its type (ie is it an operation or an operand). This bytecode then has to be extended to support (conditional) jumps and some special operations like variable assignments etc.
My implementation does not use the shunting yard algorithm, instead I use a traditional recursive parser which directly can produce the RPN bytecode. Actually the fact that I already had a parser for a simpler interpreter made me think about RPN code, because it turns out to be much easier to produce RPN bytecode in contrast to code for a more advanced virtual machine with registers. Plus the RPN bytecode implicitly still contains the complete abstract syntax tree, which in turn opens the door for aggressive optimisation algorithms, or even a bytecode transformer that produces native assembler code.

How to interpret the final bytecode?


The final interpretation of the RPN based bytecode is rather easy, you simply need a stack for all incoming value-type tokens, and then you can easily scan the RPN stream as follows:
  1. Take token from stream.

  2. If token is a value, push it onto stack.

  3. If token is an operation, pop needed operands from stack, perform operation and finally push result onto stack.
You only need to take care about the order of operands in step 3. for non-commutative operations.

Implementation?


Actually I have just implemented a basic bytecode compiler which does not use the shunting-yard algorithm to parse a script, but has a more traditional approach. But it does produce RPN bytecode and the bytecode executor simply is a big switch-statement (with different cases corresponding to different operators or value-types). You can access the scripting engine via SVN as part of the Magnum framework at svn://subversion.dimajix.de/magnum. I will try to package a smaller download version (or a new release of Magnum) soon.

Similar systems


As a reader indirectly pointed out, my system is similar to The FORTH Programming Language in that both use RPN notation (although my scripting engine only uses RPN internally an sticks to the standard notation as its input format) and two stacks (one for data and one for return addresses - the second stack currently is not explicitly implemented in Magnum, but my bytecode intepreter works recursively, so the return addresses are stored on the real CPU stack.)

4 Comments

Display comments as (Linear | Threaded)
  1. *Did you considered FORTH?

  2. *Thanks for the pointer to FORTH, indeed it looks similar to my approach, although my frontend does not require RPN notation. But as my frontend (script parser) is separated from the bytecode generator, it is possible to add a RPN based frontend. Maybe I will have a closer look at Forth.

  3. *I meant using FORTH "(byte)code" directly. FORTH could parse scripts into FORTH, and there are quality FORTH interpretors and compilers. But, because RPN is so easy to implement, I understand you want to go your own way.

    Some time ago, I needed good (but easy) script language for my project, where I want users to specify operations on 2D geometric entities (to define measurement and evaluations of camera detected shapes). I planned to use some kind of RPN and FORTH (I called this GeoForth). But users seems confused, they did not like the RPN (probably never tried Texas Instruments calculators ;-). I ended up with something modern, Prolog-like functional "programs", where users define new entities as evaluations of other entities, and the interpreter (actually, it is compiled into evaluation trees) will sort the evaluations out to provide user with all needed data.

  4. *Actually I didn't know that FORTH was using Bytecode before you were telling me so. But that would have been another possibility. As my bytecode is generated using an interface, it shouldn't be too hard to plug in a FORTH bytecode generator.

    But it would have to meet some special requirements, like the possibility of calling external functions via a simple string-lookup.

    If I find some time (not very likely) I might investigate into a FORTH backend.

Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA 1CAPTCHA 2CAPTCHA 3CAPTCHA 4CAPTCHA 5


Markdown format allowed



A Simple Sidebar