2014/10/28

Scripting, Part 9: Much To Do About Nothing


Now that our assembler is basically working, we should make it work. That means, we make sure it does not blow up on our users for formatting or commenting their code.

Yes, I decided to feel smart about the title.

Whitespace

Much like TV Tropes, whitespace will ruin your compiler's life. If you're still simply splitting the source on the ' ' character, you're running the risk of getting "\r" as a command name and "\t\t" as an argument or something.

Trying to parse that with our method is bound to result in errors, which we obviously don't want to show to the user, she's just trying to make her code more readable, after all. You could filter those cases in your error handling code, but there is just so many contextually different places you could put that tab...

Instead, we're removing all whitespace we don't want in a preprocessing step.
Whitespaces commonly found in scripts are
  • Tabstops (\t). These are easy to remove with a simple string.Replace(). We can't just replace them with empty strings though, since some may use them as the only separator between arguments. I recommend replacing them with spaces.
  • Newlines. These appear in three variants (\r, \n, \r\n, depending on OS, among other things), but we only really care about the character \r, since we use \n to split our string. To replace \r and \r\n in a single step, we'll need regular expressions. \r\n? matches both cases.

Comments

Don't think we're done with regex yet! Without them, your compiler will interpret comments either as superfluous arguments or unknown keywords. Most languages support two types of comments:
  • End of line comments are lead by some character sequence, // in C derivatives and, if memory serves, ' in BASIC derivatives, and continue until the end of the line. \/\/.* matches the double slash and any number of characters immediately following that are no line breaks. This time it's safe to just replace them with nothing.
  • Comment bodies or block comments are opened and closed by some character sequence, usually /* and */ respectively. If you want to use the loop counter to figure out line numbers for error messages, these are difficult to support: you can't just count the line breaks they contain and replace them with an equal amount, since they can be placed between arguments, thus creating at least one error that should not be there. Replacing them with a single line break does the same, replacing them with a space or nothing may pull a command into the same line as another.
    • \/\*.*\/\* only matches inlined comments (i.e. that don't contain any line breaks). Replace these with a space.
    • ^\/*[.\n]*\/\*$ only matches blocks that start and end directly between line breaks, or the beginning or end of the string. Replace these with the contained number of line breaks, or, if you can address line numbers in a way other than the loop counter, empty strings.

Directives

Now we're done with syntax that doesn't compile, but anyone who's ever used #if knows this is not all a preprocessor does. There usually are some commands that don't compile, too.
The way we're implementing them, they are not handled during preprocessing, though (at least not entirely), so we'll just call them preprocessor directives.
They all function differently, so we're just defining the basic interface today; some may need a first few passes over the entire code, some may just execute once, some may need to modify the bytecode after the fact. The two we're implementing tomorrow are both used to resolve non-numeric arguments, so we'll include two functions for that as well.

For me, the interface looks something like this:
 public abstract class Directive  
 {  
   public Directive(string name, int paramCount)  
   {  
     this.Name = name;  
     this.Parameters = paramCount;  
   }  
     
   // Lookup:  
   public string Name { get; private set; }  
   public int Parameters { get; private set; }  
     
   // Functionality  
   public virtual void Preprocess(string[][] source) { }  
   public virtual void Execute(string[] args, BinaryWriter bytecode) { }  
   public virtual void Postprocess(BinaryWriter bytecode) { }  
     
   public virtual bool CanCompileValue(string val) { return false; }  
   public virtual void CompileValue(string val, int argLength, BinaryWriter bytecode) { }  
 }  

This post's a little shorter, and maybe more of a breather to some. Tomorrow, we'll really get cracking with two directives and the logic behind them (especially the former); #define and #label.

No comments:

Post a Comment