2014/10/30

Scripting, Part 10: Directives - Defining and Resolving Symbols

Last time, we made our compiler a little more user friendly, but today, we're adding a feature that even basic compilers should have: symbols.

Generally speaking, there are two types of symbols that should interest us: constants and labels. We'll implement support for both types by adding two concrete directives, which we already defined the interface for in the last post.

Again, if you are looking for my source code, that can be found here.

Labels

are slightly easier to implement, but tremendously valuable for any language that supports control flow. For those that never worked with assembly language or batch scripts, labels are markers in your source code that identify locations to jump to.

Our compiler thus needs to support two things: parsing a label definition correctly and parsing references to them.
The former is pretty straightforward, so let's jump into the definition of our directive class:
 public class DefineLabelDirective : Directive
 {  
   public DefineLabelDirective() : base("lbl", 1) {    }  
     
   Dictionary<string, long> labelAddresses = new Dictionary<string, long>();
     
   public override void Execute(string[] args, BinaryWriter bytecode)
   {
     if (!labelAddresses.ContainsKey(args[0]))
       labelAddresses.Add(args[0], bytecode.BaseStream.Position);
   }
 }  
Note that this is abbreviated in that there is no error checking or messages.

Now our compiler notes labels as they are defined without much trouble. However, we cannot guarantee that our compiler knows every label used, since we have to pass possible calls before all labels are marked in code.
We could make this easier by marking labels during preprocessing, calculating their addresses by adding up the length of each command prior to the definition. But then you're effectively parsing the script twice, which is unnecessary.

Instead, we compile label references as a default value, save their targets and location, and finally write the correct values in a postprocessing step:
 Dictionary<long, string> labelReferences = new Dictionary<long, string>();  
   
 public override bool CanCompileValue(string val)  
 {  
   return val.StartsWith("@");  
 }  
 public override void CompileValue(string val, int argLength, BinaryWriter bytecode)  
 {  
   labelReferenes.Add(bytecode.BaseStream.Position, val);
   bytecode.Write(0L);
 }  
   
 public override void Postprocess(BinaryWriter bytecode)  
 {  
   foreach (long lbl in labelReferences.Keys)  
   {  
     bytecode.BaseStream.Position = lbl;  
       
     if (labelAddresses.ContainsKey(labelReferences[lbl]))  
       bytecode.Write(labelAddresses[labelReferences[lbl]]);  
   }  
 }  
Again, no error checking or messaging that really ought to be there (e.g. you should check if the value you're compiling actually should be a long).

Constants

are more of a usability thing, but a really great one you should not have users miss out on. There are two kinds of constants you could implement:
One can be done entirely in preprocessing, serving as a string replace.
The other one can be done either entirely during compilation or with steps taken during both preprocessing and compilation.
Since the first option is pretty trivial to implement in the Preprocess method, let's look at the latter.

During preprocessing, we want to find all definitions, make sure there are no cycles and that we can compile them all to numbers. Executing a definition is not needed, but we need to implement both the CanCompileValue and CompileValue methods.
 public class DefineConstantDirective : Directive  
 {  
   public DefineConstantDirective() : base("def", 2)  
   {  }  
     
   Dictionary<string, string> values = new Dictionary<string, string>();  
   Dictionary<string, bool> resolved = new Dictionary<string, bool>();  
     
   public override void Preprocess(string[][] source)  
   {  
     // Find definitions  
     for (int l = 0; l < source.Count; l++)  
     {  
       if (source[l].Length != this.ParameterCount + 1)  
         continue;  
   
       if (source[l][0] == "def")  
       {  
         values.Add(source[l][1], source[l][2]);  
         resolved.Add(source[l][1], false);  
       }  
     }  
       
     // TODO: Resolve definitions  
   }  
     
   public override bool CanCompileValue(string val)  
   {  
     return values.ContainsKey(val) && resolved[val];  
   }  
   public override void CompileValue(string val, int argLength, BinaryWriter bytecode)  
   {  
     int v = Convert.ToString(values[val]);  
     for (int i = 0; i < argLength; i++)  
     {  
       bytecode.Write((byte)v);  
       v >>= 8;  
     }  
   }  
 }  

Now, resolving constants is the fun bit. As you can see, I already mapped the symbols to a flag to mark certain values as resolved.
We'll be using a recursive function to resolve symbols. Basically, a symbol can be resolved if its value can be interpreted as a number or its value is itself a symbol that can be resolved. Problem is, if we do it like that, our compiler will horribly crash when symbols are defined in a cycle, so we'll mark symbols that we tried to resolve as resolved prematurely, determine if they can be resolved, and if not, flip the flag back:
 private void Resolve(string symbol, Dictionary<string, string> values, Dictionary<string, bool> resolved)  
 {  
   resolved[symbol] = true;  
   
   if (Regex.IsMatch(values[symbol], @"^\d+$"))  
     return;  
   else if (values.ContainsKey(values[symbol]))  
   {  
     // Try to resolve referenced symbol first  
     if (!resolved[values[symbol]])  
       Resolve(values[symbol], values, resolved);  
   
     if (resolved[values[symbol]] && Regex.IsMatch(values[values[symbol]], @"^\d+$"))  
       values[symbol] = values[values[symbol]];  
     else  
       resolved[symbol] = false;  
   }  
   else resolved[symbol] = false;  
 }  

I know this snippet almost looks like LISP, but it works and should not be too hard on anyone who's ever worked with recursion before. The regex @"^\d+$" matches decimal numbers only, by the way.

All that's left for you to do is to run this procedure for every symbol after scanning for definitions, and hooking up the directives with the compiler. For those of you having trouble figuring that out, I've done this in my own code, linked at the top of this post.

Other than that, we're basically done here: Over the last 10 posts, we've built a solid, reusable interpreter, a language for it to interpret, and a basic tool chain to support it. I've produced this series in a little over a week, and now you should be able to get a prototype working in the same time!

I certainly had fun researching and writing this, I hope some of you found some use in it. If you did, please leave some feedback here or on reddit, and join me in a week for some more, though I don't yet know more of what.

See ya!

No comments:

Post a Comment