2014/10/30

Scripting, Part 8: From Source to Bytecode

In the last post, I vagualy described how my compiler, which really is an assembler, works. Let's get our hands dirty by starting little. We'll discuss how to store command signatures in data and compile instructions accordingly.

What a command is made of

Looking back at our example languages, each line of a script was basically just a keyword followed by zero or more arguments of various types. When compiled, a command consists of an identifier and the same arguments (in binary, rather than in text).

So, to compile a command, we need to find a signature with the appropriate keyword, that also tells us how many arguments to expect and how to compile them. In many cases it should be enough to represent every type with an integer, indicating how many bytes it should use. In other cases, an enumeration or even a class will be in order.

For our purposes, it's enough to define command signatures like this:
 public class CommandSignature  
 {  
   public CommandSignature(string name, byte id, params byte[] args)  
   {  
     this.Name = name;  
     this.ID = id;  
     this.ParameterTypes = (args == null) ? new byte[0] : args;  
   }  
   
   public string Name { get; private set; }  
   public byte ID { get; private set; }  
   public byte[] ParameterTypes { get; private set; }  
 }  

And that's basically all you need for a first abstraction from bytecode to text.

Writing Bytecode

Now we're going to write multiple datatypes into a byte buffer again, so a BinaryWriter it is. But we're running into a problem similar to what we had to deal with for the interpreter: we don't want to write to a file immediately. And again, the solution is a MemoryStream.
However, we don't know the size of the buffer yet! We could divide parsing commands from writing them, and then add a length field to our command signature, but .NET has you covered: If we don't pass a buffer to the MemoryStream, it creates a growable buffer by itself, allowing you to write arbitrarily and retrieve the produced bytes later:
 using (BinaryWriter bytecode = new BinaryWriter(new MemoryStream)))  
 {  
   // TODO: Parse + assemble commands  
   return (bytecode.BaseStream as MemoryStream).ToArray();  
 }  

Before we write anything, though, we should make the source code a little more machine friendly. Assuming you allow exactly one command per line, you can do this pretty easily, by splitting it into a jagged array of keywords and their arguments:
 string[] lines = source.Split('\n');  
 string[][] parts = new string[lines.Length][];  
 for (int i = 0; i < lines.Length; i++)  
   parts[i] = lines[l].Split(new char[] { ' ' },
     StringSplitOptions.RemoveEmptyEntries);

With that, we can finally get to the important bits: The first thing we need to compile each command is the right command signature. We could use a dictionary to map keywords to commands, but that won't work in case you want to overload a command (e.g. like ADD r0, #0xFF and ADD r0, r0, r1 in actual assembly). Instead, we retrieve the first command that has the right name and number of parameters:
 List<CommandSignature> commands;  // Initialized elsewhere  
   
 for (int l = 0; l < parts.Length; l++)  
 {  
   if (commands.Contains(sig =>
     (sig.Name == parts[l][0]) &&
     (sig.ParameterTypes.Length == parts[l].Length - 1)))
   {  
     CommandSignature cmd = commands.First(sig =>
       (sig.Name == parts[l][0]) &&
       (sig.ParameterTypes.Length == parts[l].Length - 1));
     bytecode.Write(cmd.ID);  
       
     // TODO: Write arguments  
   }  
   else  
     // TODO: Print error  
 }  

I'll leave the specifics of errors to you. When implementing them, just keep in mind what shitty compile error messages do to you. Make sure you make them as fine grained as you can.

Writing arguments

is a fun little part of the assembler. Right now, our command signatures just tell us how many bytes each argument should take. Naturally, you could use a switch statement or another array of delegates, but a more direct way to do it is to do the conversion to bytes manually.

To do that, we can take advantage of the fact that the BinaryReader and BinaryWriter use little endian notation, i.e. the order of bytes within a four byte integer is reversed (e.g. 0x00ABCDEF is represented as [EF][CD][AB][00]).

Since the bytes with lesser significance come first, the trick is to rightshift the integer value by 8 bits everytime after casting to a byte, effectively writing each byte individually:
 for (int p = 0; p < cmd.ParameterTypes.Length; p++)  
 {  
   int val = Convert.ToInt32(parts[l][p + 1]);  
     
   for (int b = 0; b < cmd.ParameterTypes[p]; b++)  
   {  
     bytecode.Write((byte)val);  
     val = val >> 8;  
   }  
 }  

And that concludes another part earlier than I'd wish. In the next post, We'll make the compiler a little more usable by parsing around whitespace and comments, and hopefully build the foundation for the final post on symbols like constants and labels.

No comments:

Post a Comment