Felix Ple┼čoianu, 2008-09-07

Make Your Own Programming Language
Part 2

This is the second in a 5-part tutorial on how to implement a programming language. It is intended for people with some programming experience, who want to know how their compiler, interpreter or virtual machine works. Hint: it's not magic.

In Part 1 you've learned to make yourself a simple programming language based around two data structures: a stack and a dictionary. Scratch — as I named it — is useful, but needs a few more things. In this installment, I will show you how to do the following:

Interestingly enough, all these features depend on exactly one modification to the interpreter.

Reading ahead

How would you go about declaring a variable in Scratch? The interpreter doesn't yet know its name, so you can't just say X VAR: it would choke on the X in the first place. So it has to be VAR X (like in Javascript... fitting), except VAR can't be a word, because it would have to read ahead in the input stream, and words can't even see the lexer. Or can they? Let's look at the run() method again.

	this.run = function (text) {
		var lexer = new ScratchLexer(text);
		var word;
		var num_val;

		while (word = lexer.nextWord()) {
			word = word.toUpperCase();
			num_val = parseFloat(word);
			if (dictionary[word]) {
				dictionary[word](this);
			} else if (!isNaN(num_val)) {
				this.stack.push(num_val);
			} else {
				throw "Unknown word";
			}
		}
	};

As you can see, my first instinct was to make lexer local, as per the principle of least privilege. But this is really hampering me now. I could make variable declarations a special case (another else if...) and go downhill from there. Or I could simply change two lines of code.

		this.lexer = new ScratchLexer(text);

		...

		while (word = this.lexer.nextWord()) {

Now I'm free to make VAR an ordinary word. Openness for the win!

function makeVariable(terp) {
	var me = {value: 0};
	return function () { terp.stack.push(me); };
}

var VariableWords = {
	// Read next word from input and make it a variable.
	"VAR": function (terp) {
		var var_name = terp.lexer.nextWord();
		if (var_name == null) throw "Unexpected end of input";

		terp.define(var_name, makeVariable(terp));
	},

};

	// This goes into the definition of Scratch()
	this.define = function (word, code) {
		dictionary[word.toUpperCase()] = code;
	}

OK, that was a little complicated: each new variable is hidden inside a word that, when executed, puts a reference to it on the stack. And once we have a reference... well, you guessed it.

var VariableWords = {
	...

	// Store value of 2OS into variable given by TOS.
	"STORE": function (terp) {
		if (terp.stack.length < 2) throw "Not enough items on stack";
		var reference = terp.stack.pop();
		var new_value = terp.stack.pop();
		reference.value = new_value;
	},
	// Replace reference to variable on TOS with its value.
	"FETCH": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		var reference = terp.stack.pop();
		terp.push(reference.value);
	}
}

Come on, check that it works:

var a
10 a store
a fetch print

As for constants (I promised, didn't I), it should be easy now. All they need to do is push a fixed value on the stack.

function makeConstant(value, terp) {
	return function () { terp.stack.push(value); }
}

var ConstantWords = {
	// Read next word from input and make it a constant with TOS as value.
	"CONST": function (terp) {
		if (terp.stack.length < 1) throw "Not enough items on stack";
		var const_name = terp.lexer.nextWord();
		if (const_name == null) throw "Unexpected end of input";

		var const_value = terp.stack.pop();
		terp.define(const_name, makeConstant(const_value, terp));
	}
}

Except right now they aren't really constants (they can be redefined), so they could just as well serve as the variables of Scratch. But this is only because we're coding in Javascript here; other languages may not be so flexible.

Reading ahead some more

Now it should be clear how to make strings: define a word (a double quote should do nicely) that reads ahead until it meets an end-of-string marker. A second double quote is best IMO. Then we push the collected string onto the stack.

var StringWords = {
	"\"": function (terp) {
		var collector = "";
		var done = false;
		do {
			var next_word = terp.lexer.nextWord();
			if (next_word == null) throw "Unexpected end of input";
			if (next_word.substr(-1, 1) == "\"") {
				next_word = next_word.slice(0, -1);
				collector += next_word;
				done = true;
			} else {
				collector += next_word;
				collector += " ";
			}
		} while (!done);
		terp.stack.push(collector);
	}
}

Now you can finally say:

" Hello, world" print

And yes, that's a space after the first double quote. It's a word like any other, remember?

Comments are similar, except they don't need to keep the content.

var CommentWords = {
	"/*": function (terp) {
		do {
			var next_word = terp.lexer.nextWord();
			if (next_word == null) throw "Unexpected end of input";
		} while (next_word.substr(-2, 2) != "*/");
	}
};

It is interesting to note that Forth (my source of inspiration, as noted in Part 1) uses parantheses to delimit comments. Did you notice they're not needed for anything else? Forth also has a word called "*/". You can still implement it.

We're almost done here, with one little exception: what if you want to preserve whitespace inside strings? Or — shudder — implement C++ style comments that run to the end of line? As before, both problems have the same solution, and it lies in the lexer.

Reading character-by-character

Trouble is, I've written my little lexer to read word by word. That was in order to keep things as simple as possible. The bad news is, I have to rewrite it. The good news is, I don't need to touch anything else.

// Trying to avoid regular expressions here.
function isWhitespace(char) {
	return (char == " ")
		|| (char == "\t")
		|| (char == "\r")
		|| (char == "\n")
		|| (char == "\v");
}

function ScratchLexer(text) {
	var position = 0; // Beginning of TEXT.

	this.nextWord = function() {
		if (position >= text.length) return null;
		while (isWhitespace(text.charAt(position))) {
			position ++;
			if (position >= text.length) return null;
		}
		var new_pos = position;
		while (!isWhitespace(text.charAt(new_pos))) {
			new_pos ++;
			if (new_pos >= text.length) break;
		}
		var collector = text.substring(position, new_pos);
		new_pos ++; position = new_pos; // Skip the delimiter.
		return collector;
	};

	this.nextCharsUpTo = function (char) {
		if (position >= text.length) return null;
		var new_pos = position;
		while (text.charAt(new_pos) != char) {
			new_pos ++;
			if (new_pos >= text.length)
				throw "Unexpected end of input";
		}
		var collector = text.substring(position, new_pos);
		new_pos ++; position = new_pos; // Skip the delimiter.
		return collector;
	};
}

Whoa, looks like we've gained a lot of complexity all of a sudden. But just look at our string-making word now.

var StringWords = {
	"\"": function (terp) {
		terp.stack.push(terp.lexer.nextCharsUpTo("\""));
	}
};

Isn't that precious? You can even keep the old version around, with a different name; it will work just like before.

That concludes the second installment. In Part 3 you'll (finally) learn how to implement defining new words in Scratch itself.

Try Scratch
Stack content:

P.S. Hiding variables in a function, to be accessed indirectly after that is called "making a closure". It's an advanced, and very powerful programming concept.

Creative Commons License
Make Your Own Programming Language by Felix Pleşoianu is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.