2014/10/30

Scripting, Part 8: From Source to Bytecode

In the last post, I vagualy described how my compiler, which really is an assembler, works. Let's get our hands dirty by starting little. We'll discuss how to store command signatures in data and compile instructions accordingly.

What a command is made of

Looking back at our example languages, each line of a script was basically just a keyword followed by zero or more arguments of various types. When compiled, a command consists of an identifier and the same arguments (in binary, rather than in text).

So, to compile a command, we need to find a signature with the appropriate keyword, that also tells us how many arguments to expect and how to compile them. In many cases it should be enough to represent every type with an integer, indicating how many bytes it should use. In other cases, an enumeration or even a class will be in order.

For our purposes, it's enough to define command signatures like this:
 public class CommandSignature  
 {  
   public CommandSignature(string name, byte id, params byte[] args)  
   {  
     this.Name = name;  
     this.ID = id;  
     this.ParameterTypes = (args == null) ? new byte[0] : args;  
   }  
   
   public string Name { get; private set; }  
   public byte ID { get; private set; }  
   public byte[] ParameterTypes { get; private set; }  
 }  

And that's basically all you need for a first abstraction from bytecode to text.

Writing Bytecode

Now we're going to write multiple datatypes into a byte buffer again, so a BinaryWriter it is. But we're running into a problem similar to what we had to deal with for the interpreter: we don't want to write to a file immediately. And again, the solution is a MemoryStream.
However, we don't know the size of the buffer yet! We could divide parsing commands from writing them, and then add a length field to our command signature, but .NET has you covered: If we don't pass a buffer to the MemoryStream, it creates a growable buffer by itself, allowing you to write arbitrarily and retrieve the produced bytes later:
 using (BinaryWriter bytecode = new BinaryWriter(new MemoryStream)))  
 {  
   // TODO: Parse + assemble commands  
   return (bytecode.BaseStream as MemoryStream).ToArray();  
 }  

Before we write anything, though, we should make the source code a little more machine friendly. Assuming you allow exactly one command per line, you can do this pretty easily, by splitting it into a jagged array of keywords and their arguments:
 string[] lines = source.Split('\n');  
 string[][] parts = new string[lines.Length][];  
 for (int i = 0; i < lines.Length; i++)  
   parts[i] = lines[l].Split(new char[] { ' ' },
     StringSplitOptions.RemoveEmptyEntries);

With that, we can finally get to the important bits: The first thing we need to compile each command is the right command signature. We could use a dictionary to map keywords to commands, but that won't work in case you want to overload a command (e.g. like ADD r0, #0xFF and ADD r0, r0, r1 in actual assembly). Instead, we retrieve the first command that has the right name and number of parameters:
 List<CommandSignature> commands;  // Initialized elsewhere  
   
 for (int l = 0; l < parts.Length; l++)  
 {  
   if (commands.Contains(sig =>
     (sig.Name == parts[l][0]) &&
     (sig.ParameterTypes.Length == parts[l].Length - 1)))
   {  
     CommandSignature cmd = commands.First(sig =>
       (sig.Name == parts[l][0]) &&
       (sig.ParameterTypes.Length == parts[l].Length - 1));
     bytecode.Write(cmd.ID);  
       
     // TODO: Write arguments  
   }  
   else  
     // TODO: Print error  
 }  

I'll leave the specifics of errors to you. When implementing them, just keep in mind what shitty compile error messages do to you. Make sure you make them as fine grained as you can.

Writing arguments

is a fun little part of the assembler. Right now, our command signatures just tell us how many bytes each argument should take. Naturally, you could use a switch statement or another array of delegates, but a more direct way to do it is to do the conversion to bytes manually.

To do that, we can take advantage of the fact that the BinaryReader and BinaryWriter use little endian notation, i.e. the order of bytes within a four byte integer is reversed (e.g. 0x00ABCDEF is represented as [EF][CD][AB][00]).

Since the bytes with lesser significance come first, the trick is to rightshift the integer value by 8 bits everytime after casting to a byte, effectively writing each byte individually:
 for (int p = 0; p < cmd.ParameterTypes.Length; p++)  
 {  
   int val = Convert.ToInt32(parts[l][p + 1]);  
     
   for (int b = 0; b < cmd.ParameterTypes[p]; b++)  
   {  
     bytecode.Write((byte)val);  
     val = val >> 8;  
   }  
 }  

And that concludes another part earlier than I'd wish. In the next post, We'll make the compiler a little more usable by parsing around whitespace and comments, and hopefully build the foundation for the final post on symbols like constants and labels.

Scripting, Part 10: Directives - Defining and Resolving Symbols

Last time, we made our compiler a little more user friendly, but today, we're adding a feature that even basic compilers should have: symbols.

Generally speaking, there are two types of symbols that should interest us: constants and labels. We'll implement support for both types by adding two concrete directives, which we already defined the interface for in the last post.

Again, if you are looking for my source code, that can be found here.

Labels

are slightly easier to implement, but tremendously valuable for any language that supports control flow. For those that never worked with assembly language or batch scripts, labels are markers in your source code that identify locations to jump to.

Our compiler thus needs to support two things: parsing a label definition correctly and parsing references to them.
The former is pretty straightforward, so let's jump into the definition of our directive class:
 public class DefineLabelDirective : Directive
 {  
   public DefineLabelDirective() : base("lbl", 1) {    }  
     
   Dictionary<string, long> labelAddresses = new Dictionary<string, long>();
     
   public override void Execute(string[] args, BinaryWriter bytecode)
   {
     if (!labelAddresses.ContainsKey(args[0]))
       labelAddresses.Add(args[0], bytecode.BaseStream.Position);
   }
 }  
Note that this is abbreviated in that there is no error checking or messages.

Now our compiler notes labels as they are defined without much trouble. However, we cannot guarantee that our compiler knows every label used, since we have to pass possible calls before all labels are marked in code.
We could make this easier by marking labels during preprocessing, calculating their addresses by adding up the length of each command prior to the definition. But then you're effectively parsing the script twice, which is unnecessary.

Instead, we compile label references as a default value, save their targets and location, and finally write the correct values in a postprocessing step:
 Dictionary<long, string> labelReferences = new Dictionary<long, string>();  
   
 public override bool CanCompileValue(string val)  
 {  
   return val.StartsWith("@");  
 }  
 public override void CompileValue(string val, int argLength, BinaryWriter bytecode)  
 {  
   labelReferenes.Add(bytecode.BaseStream.Position, val);
   bytecode.Write(0L);
 }  
   
 public override void Postprocess(BinaryWriter bytecode)  
 {  
   foreach (long lbl in labelReferences.Keys)  
   {  
     bytecode.BaseStream.Position = lbl;  
       
     if (labelAddresses.ContainsKey(labelReferences[lbl]))  
       bytecode.Write(labelAddresses[labelReferences[lbl]]);  
   }  
 }  
Again, no error checking or messaging that really ought to be there (e.g. you should check if the value you're compiling actually should be a long).

Constants

are more of a usability thing, but a really great one you should not have users miss out on. There are two kinds of constants you could implement:
One can be done entirely in preprocessing, serving as a string replace.
The other one can be done either entirely during compilation or with steps taken during both preprocessing and compilation.
Since the first option is pretty trivial to implement in the Preprocess method, let's look at the latter.

During preprocessing, we want to find all definitions, make sure there are no cycles and that we can compile them all to numbers. Executing a definition is not needed, but we need to implement both the CanCompileValue and CompileValue methods.
 public class DefineConstantDirective : Directive  
 {  
   public DefineConstantDirective() : base("def", 2)  
   {  }  
     
   Dictionary<string, string> values = new Dictionary<string, string>();  
   Dictionary<string, bool> resolved = new Dictionary<string, bool>();  
     
   public override void Preprocess(string[][] source)  
   {  
     // Find definitions  
     for (int l = 0; l < source.Count; l++)  
     {  
       if (source[l].Length != this.ParameterCount + 1)  
         continue;  
   
       if (source[l][0] == "def")  
       {  
         values.Add(source[l][1], source[l][2]);  
         resolved.Add(source[l][1], false);  
       }  
     }  
       
     // TODO: Resolve definitions  
   }  
     
   public override bool CanCompileValue(string val)  
   {  
     return values.ContainsKey(val) && resolved[val];  
   }  
   public override void CompileValue(string val, int argLength, BinaryWriter bytecode)  
   {  
     int v = Convert.ToString(values[val]);  
     for (int i = 0; i < argLength; i++)  
     {  
       bytecode.Write((byte)v);  
       v >>= 8;  
     }  
   }  
 }  

Now, resolving constants is the fun bit. As you can see, I already mapped the symbols to a flag to mark certain values as resolved.
We'll be using a recursive function to resolve symbols. Basically, a symbol can be resolved if its value can be interpreted as a number or its value is itself a symbol that can be resolved. Problem is, if we do it like that, our compiler will horribly crash when symbols are defined in a cycle, so we'll mark symbols that we tried to resolve as resolved prematurely, determine if they can be resolved, and if not, flip the flag back:
 private void Resolve(string symbol, Dictionary<string, string> values, Dictionary<string, bool> resolved)  
 {  
   resolved[symbol] = true;  
   
   if (Regex.IsMatch(values[symbol], @"^\d+$"))  
     return;  
   else if (values.ContainsKey(values[symbol]))  
   {  
     // Try to resolve referenced symbol first  
     if (!resolved[values[symbol]])  
       Resolve(values[symbol], values, resolved);  
   
     if (resolved[values[symbol]] && Regex.IsMatch(values[values[symbol]], @"^\d+$"))  
       values[symbol] = values[values[symbol]];  
     else  
       resolved[symbol] = false;  
   }  
   else resolved[symbol] = false;  
 }  

I know this snippet almost looks like LISP, but it works and should not be too hard on anyone who's ever worked with recursion before. The regex @"^\d+$" matches decimal numbers only, by the way.

All that's left for you to do is to run this procedure for every symbol after scanning for definitions, and hooking up the directives with the compiler. For those of you having trouble figuring that out, I've done this in my own code, linked at the top of this post.

Other than that, we're basically done here: Over the last 10 posts, we've built a solid, reusable interpreter, a language for it to interpret, and a basic tool chain to support it. I've produced this series in a little over a week, and now you should be able to get a prototype working in the same time!

I certainly had fun researching and writing this, I hope some of you found some use in it. If you did, please leave some feedback here or on reddit, and join me in a week for some more, though I don't yet know more of what.

See ya!

2014/10/28

Scripting, Part 9: Much To Do About Nothing


Now that our assembler is basically working, we should make it work. That means, we make sure it does not blow up on our users for formatting or commenting their code.

Yes, I decided to feel smart about the title.

Whitespace

Much like TV Tropes, whitespace will ruin your compiler's life. If you're still simply splitting the source on the ' ' character, you're running the risk of getting "\r" as a command name and "\t\t" as an argument or something.

Trying to parse that with our method is bound to result in errors, which we obviously don't want to show to the user, she's just trying to make her code more readable, after all. You could filter those cases in your error handling code, but there is just so many contextually different places you could put that tab...

Instead, we're removing all whitespace we don't want in a preprocessing step.
Whitespaces commonly found in scripts are
  • Tabstops (\t). These are easy to remove with a simple string.Replace(). We can't just replace them with empty strings though, since some may use them as the only separator between arguments. I recommend replacing them with spaces.
  • Newlines. These appear in three variants (\r, \n, \r\n, depending on OS, among other things), but we only really care about the character \r, since we use \n to split our string. To replace \r and \r\n in a single step, we'll need regular expressions. \r\n? matches both cases.

Comments

Don't think we're done with regex yet! Without them, your compiler will interpret comments either as superfluous arguments or unknown keywords. Most languages support two types of comments:
  • End of line comments are lead by some character sequence, // in C derivatives and, if memory serves, ' in BASIC derivatives, and continue until the end of the line. \/\/.* matches the double slash and any number of characters immediately following that are no line breaks. This time it's safe to just replace them with nothing.
  • Comment bodies or block comments are opened and closed by some character sequence, usually /* and */ respectively. If you want to use the loop counter to figure out line numbers for error messages, these are difficult to support: you can't just count the line breaks they contain and replace them with an equal amount, since they can be placed between arguments, thus creating at least one error that should not be there. Replacing them with a single line break does the same, replacing them with a space or nothing may pull a command into the same line as another.
    • \/\*.*\/\* only matches inlined comments (i.e. that don't contain any line breaks). Replace these with a space.
    • ^\/*[.\n]*\/\*$ only matches blocks that start and end directly between line breaks, or the beginning or end of the string. Replace these with the contained number of line breaks, or, if you can address line numbers in a way other than the loop counter, empty strings.

Directives

Now we're done with syntax that doesn't compile, but anyone who's ever used #if knows this is not all a preprocessor does. There usually are some commands that don't compile, too.
The way we're implementing them, they are not handled during preprocessing, though (at least not entirely), so we'll just call them preprocessor directives.
They all function differently, so we're just defining the basic interface today; some may need a first few passes over the entire code, some may just execute once, some may need to modify the bytecode after the fact. The two we're implementing tomorrow are both used to resolve non-numeric arguments, so we'll include two functions for that as well.

For me, the interface looks something like this:
 public abstract class Directive  
 {  
   public Directive(string name, int paramCount)  
   {  
     this.Name = name;  
     this.Parameters = paramCount;  
   }  
     
   // Lookup:  
   public string Name { get; private set; }  
   public int Parameters { get; private set; }  
     
   // Functionality  
   public virtual void Preprocess(string[][] source) { }  
   public virtual void Execute(string[] args, BinaryWriter bytecode) { }  
   public virtual void Postprocess(BinaryWriter bytecode) { }  
     
   public virtual bool CanCompileValue(string val) { return false; }  
   public virtual void CompileValue(string val, int argLength, BinaryWriter bytecode) { }  
 }  

This post's a little shorter, and maybe more of a breather to some. Tomorrow, we'll really get cracking with two directives and the logic behind them (especially the former); #define and #label.

2014/10/26

Scripting, Part 7: The Perks of Building a Compiler

Now that we have designed a language and built a full interpreter around it, you might think that it's time to build a compiler for it. Instead, we'll discuss why this may be a bad idea and what alternatives there are.

The inner workings

To get a grip on what's bad aboud coding a compiler, one should know what actually needs to be coded. Modern high-livel compilers consist of two parts: Scanning and parsing.
Scanning refers to grouping the source into tokens and identifying their type, and parsing refers to analysing the tokens' syntactical structure. This is some complex stuff, and both steps can be terrifying to implement - there's a reason you can major in compiler construction, and I'm not going to pretend that I can cover that with a blog post or three.

What I implemented is technically an assembler, i.e. instructions can be compiled independently, since there is no overarching syntax; each instruction in code equals exactly one instruction in byte code.
So far, my assembler consists of these basic steps:
  • Format the code for easier parsing (remove unneeded whitespace, comments),
  • cut it into instructions, which are then cut into key word + arguments,
  • lookup command definitions by key word,
  • compile the instruction according to definition.
I also added support for macros by adding a fallback to the command lookup: When no command definition is found, see if there's a macro defined, and recursively call the compile function on the macro code. By using format strings, you can easily pass parameters to the macros, too.

Finally, there are several features you may want to add, depending on the language. For control flow, you'll need lables to jump to, in most languages symbols (i.e. constants) will be useful, etc.

Assembly language

Coding an assembler may be easier, but it also forces users to write in an equivalent of assembly. This has the opposite effect of what scripting should be - you want to code on a higher level, not a lower one.
Sure, our instructions are based around higher level concepts, but in the end, you're still compiling code like this:
 Action 1  
 Action 2  
 Action 3  
 Action 4  
 Succeeder  
 Sequence 3  
 Action 5  
 Selector 3  
 End  
And that's for describing a tree!

To be fair, some use cases are more suitable for this type of language. This is a conceivable NPC script in an arbitrary language:
 facePlayer  
 showMsg "Hello World! Do you like assembly?"  
 showYesNoBox  
 if YES goto .yesAnswer  
 showMsg "Too bad..."  
 end  
   
 .yesAnswer:  
 showMsg "Me too!"  
 end  
And don't run away now because I'm using goto, it really isn't as bad as people make it out to be, especially in short scripts like this.

Alternatives to code

The point still stands, though: designing a language and building a compiler for it is not the optimal path in a lot of cases. But you need to produce the code somehow, right? Well, I can think of three alternatives:

Dedicated editors or parsers
Suppose we're still on the case of our behaviour tree markup language. Instead of people coding tree creation, it's much simpler for everyone (except maybe the tools programmer) to build a dedicated tool for visual construction of behaviour trees.
Even when a dedicated tool is not an option, parsing the XML or JSON output of existing tools is probably no more complicated than building an assembler, and the emerging tool chain is much more user friendly for everyone involved.
Assuming the tree structure already exists in memory, generating the above code is dead simple and can be done with a single recursive method.

Dedicated GUI script editor
The possibly easiest, if code heaviest, option is probably a button based script editor similar to the one used by the RPG Maker:


This lends itself really well to the MVC pattern, too. When you're operating on a list of command objects, created by the respective buttons / dialogs, compiling becomes a matter of looping through that list, syntax highlighting can be done on object creation, and syntax errors become virtually impossible.
If you're not comfortable in this kind of environment, imagine the same principle in a console. Now translate the console to a single text box, while the actual code appears in the same kind of list as shown above. Add some functionality to the up and down arrow keys and some shortcuts and you're good to go.

Flow charts
If none of that is for you, see if you can get your hands on a good tool for flow charts and its output format. Parse and compile it right and you essentially get Unreal's Blueprints. Scripting with flow charts is a very visual, intuitive programming method and allows most people to get the gist of it all quite quickly.
This is also the most difficult to get working, and working right, though, and I haven't really done anything like it myself, so I will refrain from further commenting on it.

And that's it for now! If you still want to get a compiler going, the next three posts will deal with just that, and that will be the conclusion of this series. If not, I guess this journey ends here for you. In that case, I hope I had something useful to say to you, and wish you all happy coding, or whatever programmers wish each other.

2014/10/24

Scripting, Part 6: Behaviour Tree Builder - Interpreter

In the last post, I designed a simple markup language for behaviour trees, outlining all available commands. In this post, we're implementing an interpreter for that language.

Behaviour trees in code

Behaviour trees are the kind of thing lots of libraries are programmed for, but I'd rather not make this post implementation dependent, so I'll assume a simple, generic API: All types of nodes are subclasses of BehaviourTreeNode, and their constructors either take an integer as an argument (leaf nodes), a single BehaviourTreeNode, or an IEnumerable<BehaviouTreeNode> collection, depending on the type of node.

The respective classes are Action, RunBehaviour, Sequence, Selector, Inverter, Succeeder, Repeater and RepeatUntilFail.

I'm not making any other assumptions about how they are all implemented, but I don't have to, since a script is only supposed to create a behaviour tree, not use one.

The interpreter

To build the interpreter, we subclass the BaseInterpreter class we developed in the previous posts (as a reminder, the source code can be found here). To do that, we need to implement the abstract method InitInsn, which initializes the instruction set, i.e. the two arrays of delegates. We can't actually do that yet - we haven't programmed any methods to add to the instruction set yet - so let's see what else needs to be done.
As explained in the previous post, we need to add a stack of BehaviourTreeNode to keep track of previously defined nodes. For convenience, we'll also add a method Run that executes the entire byte code passed to it and returns the generated behaviour tree:
 Stack<BehaviouTreeNode> defined;  
 bool running = false;  
 
 public BehaviourTreeNode Run(byte[] code)  
 {  
   InitScript();  
   running = true;  
   while (running)  
     NextCommand();  
   return defined.Pop();  
 }

Now we need a way to break out of this loop. Let's implement the End command first:
 private void Cmd00_End()  
 {  
   running = false;
 }  
   
 private string Cmd00_Decompile()  
 {  
   return "End";
 }

With that, we can get to the more interesting commands. Let's start with the Action command:
 private void Cmd01_Action()  
 {  
   defined.Push(new Action(DataBus.ReadInt32()));
 }  
   
 private string Cmd01_Decompile()  
 {  
   int id = DataBus.ReadInt32();
   return String.Format("Action {0}", id);
 }

The RunBehaviour command, of course, is analogous to that. To create the internal nodes, we need to pop values off the stack:
 private void Cmd03_Inverter()  
 {  
   defined.Push(new Inverter(defined.Pop()));
 }
 
 // ...
 
 private void Cmd07_Sequence()
 {
   BehaviourTreeNode[] children = new BehaviourTreeNode[DataBus.ReadInt32()];
   for (int i = children.Length; i >= 0; i--)
     children[i] = defined.Pop();
   defined.Push(new Sequence(children));
 }
Note that, in the second method, the nodes are added to the temporary array in reverse order. This is necessary because, when a tree is converted to postfix notation (like our script code), nodes appear left-to-right, but when popped off the stack appear right-to-left.

With all commands implemented, we can now initialize the instruction set:
 public void InitInsn()  
 {  
   Execute = new Action[] {
     Cmd00_End, Cmd01_Action, Cmd02_RunBehaviour,
     Cmd03_Inverter, Cmd04_Succeeder, Cmd05_Repeater,
     Cmd06_RepeatUntilFail, Cmd07_Sequence, Cmd08_Selector };
   Decompile = new Func<string>[] {
     Cmd00_Decompile, Cmd01_Decompile, Cmd02_Decompile,
     Cmd03_Decompile, Cmd04_Decompile, Cmd05_Decompile,
     Cmd06_Decompile, Cmd07_Decompile, Cmd08_Decompile };
 }  
And that's it! As you can see, I left out some of the necessary methods, but they just mirror the ones I explained here. If you want to see all the code for this part, see this paste.

I hope having an example of a concrete interpreter at hand makes the whole process of designing and implementing a language more accessible. In the next post, I'll finally talk about compilers - how they work, when you should or should not use one, and what alternatives there are.