Easy Bytecode
Because of this insight, I began to concentrate on writing an easy-to-implement bytecode compiler that would transform a text-based script into a more machine-friendly representation. A positive side-effect of bytecode is that simple fact that it is much faster to execute than the original textual representation. The downside of this approach is the fact that it involves writing a compiler - something that sounds to be a complex and difficult task.
But after analysing the process of parsing a script, I came to the conclusion that such a compiler and its corresponding bytecode interpreter ("virtual machine") would be rather straight-forward and easy to implement, if the underlying model of the virtual machine is chosen carefully. In my opinion the best machine model (in terms of simplicity to implement a compiler and interpreter) it a purely stack-based RPN (Reverse Polish notation) machine. And such a model not only is easy to implement, but it also easy to extract the original syntax tree from the bytecode, which in turn allows further optimisation techniques as a postprocessing step (it even wouldn't be to hard to turn bytecode into native assembler).
What does RPN mean?
Reverse Polish Notation (or short RPN) is a special notation of mathematical operations, wherein every operator follows all of its operands (this is also called postfix notation). This means that instead of writing "3+5", the RPN representation would be "3 5 +".
Even more complex expressions can be easily transformed into the RPN notation, take for example "(2+5) (9-3*2)". This can be represented in RPN as "2 5 + 9 3 2 - *". There is a well-known algorithm (the Shunting yard algorithm) that converts any traditional infix-notation into an RPN expression. This also includes function calls and any sort of constant objects (like strings, numbers etc).
How to generate RPN bytecode from a procedural scripting language?
You can easily use the aforementioned Shunting yard algorithm to convert any expression into an RPN notation. In order to get bytecode, you only have to tag each element from the RPN stream corresponding to its type (ie is it an operation or an operand). This bytecode then has to be extended to support (conditional) jumps and some special operations like variable assignments etc.
My implementation does not use the shunting yard algorithm, instead I use a traditional recursive parser which directly can produce the RPN bytecode. Actually the fact that I already had a parser for a simpler interpreter made me think about RPN code, because it turns out to be much easier to produce RPN bytecode in contrast to code for a more advanced virtual machine with registers. Plus the RPN bytecode implicitly still contains the complete abstract syntax tree, which in turn opens the door for aggressive optimisation algorithms, or even a bytecode transformer that produces native assembler code.
How to interpret the final bytecode?
The final interpretation of the RPN based bytecode is rather easy, you simply need a stack for all incoming value-type tokens, and then you can easily scan the RPN stream as follows:
- Take token from stream.
- If token is a value, push it onto stack.
- If token is an operation, pop needed operands from stack, perform operation and finally push result onto stack.
Implementation?
Actually I have just implemented a basic bytecode compiler which does not use the shunting-yard algorithm to parse a script, but has a more traditional approach. But it does produce RPN bytecode and the bytecode executor simply is a big switch-statement (with different cases corresponding to different operators or value-types). You can access the scripting engine via SVN as part of the Magnum framework at svn://subversion.dimajix.de/magnum. I will try to package a smaller download version (or a new release of Magnum) soon.
Similar systems
As a reader indirectly pointed out, my system is similar to The FORTH Programming Language in that both use RPN notation (although my scripting engine only uses RPN internally an sticks to the standard notation as its input format) and two stacks (one for data and one for return addresses - the second stack currently is not explicitly implemented in Magnum, but my bytecode intepreter works recursively, so the return addresses are stored on the real CPU stack.)


