「Go 言語でつくるインタプリタ」では、Monkey という小さな言語のインタプリタを Go で実装する過程が書かれています。
あくまで教育目的として、字句解析〜構文解析〜評価 までをシンプルな実装で丁寧に解説しています。
ここでは、Go で写経するだけなのも面白みが無いので、「Go 言語でつくるインタプリタ」の流れを Java に変換しつつ紹介してみます。
Monkey という言語は以下のような感じになります。
let x = 10; let y = 15; let add = fn(a, b) { return a + b; };
字句解析器(Lexer)
最初に Lexer を作っていきます。ソースコードの内容を字句解析し、構文解析の入力となる表現に変換するだけです。
最初に字句のタイプを表現する TokenType
を定義します。
package monkey.token; public enum TokenType { ILLEGAL("ILLEGAL"), EOF("EOF"), IDENT("IDENT"), // add, foobar, x, y, ... INT("INT"), ASSIGN("="), PLUS("+"), COMMA(","), SEMIC(";"), LPAREN("("), RPAREN(")"), LBRACE("{"), RBRACE("}"), FUNC("FUNC"), LET("LET"), ; private final String str; TokenType(String str) { this.str = str; } public String str() { return str; } }
コードから読み取った字句を表現する Token
を定義します。
package monkey.token; public class Token { public final TokenType type; public final String literal; public Token(TokenType type, String literal) { this.type = type; this.literal = literal; } }
そして、字句解析器である Lexer
です。
package monkey.lexer; import monkey.token.Token; public class Lexer { public Lexer(String input) { } public Token nextToken() { return null; } @Override public String toString() { // ... } @Override public boolean equals(Object o) { // ... } @Override public int hashCode() { // ... } }
まだ空実装ですが、入力を取り、nextToken()
で Token
オブジェクトとして取得できる、ただそれだけのものです。
初期の Lexer
最初の Lexer
は単純に、入力として =+(){},;
を受け、nextToken
により Token
のインスタンスが得られるものとします。
すなわち、テストケースは以下のようになります。
public class LexerTest { @Test public void nextToken() { String input = "=+(){},;"; Lexer lexer = new Lexer(input); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.PLUS, "+"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LBRACE, "{"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RBRACE, "}"))); assertThat(lexer.nextToken(), is(new Token(TokenType.COMMA, ","))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); } }
input
を入力とし、nextToken()
により読み取った字句を検証しているだけですね。
では、テストを通すように Lexer
を実装していきます。
package monkey.lexer; import monkey.token.Token; import monkey.token.TokenType; public class Lexer { /** input string. */ private String input; /** current position in input (points to current char). */ private int position; /** current reading position in input (after current char). */ private int readPosition; /** current char under examination. */ private char ch; public Lexer(String input) { this.input = input; this.readPosition = 0; readChar(); } public Token nextToken() { Token token; switch (ch) { case '=' : token = new Token(TokenType.ASSIGN, "="); break; case '+' : token = new Token(TokenType.PLUS, "+"); break; case '(' : token = new Token(TokenType.LPAREN, "("); break; case ')' : token = new Token(TokenType.RPAREN, ")"); break; case '{' : token = new Token(TokenType.LBRACE, "{"); break; case '}' : token = new Token(TokenType.RBRACE, "}"); break; case ',' : token = new Token(TokenType.COMMA, ","); break; case ';' : token = new Token(TokenType.SEMIC, ";"); break; case (byte) 0 : token = new Token(TokenType.EOF, ""); break; } readChar(); return token; } private void readChar() { if (readPosition >= input.length()) { ch = (byte) 0; } else { ch = input.charAt(readPosition); } position = readPosition; readPosition += 1; } }
単純に、入力を1つづつ読み、読み込んだ文字に応じて Token
をインスタンス化して返却するだけのコードになりました。
これでテストケースがグリーンとなり、初期の Lexer
ができました。
Lexer の拡張
以下のような input
文字を字句解析できるようにしていきましょう。
@Test public void nextToken() { String input = "let five = 5;" + "\n" + "let ten = 10;" + "\n" + "let add = fn(x, y) {" + "\n" + " x + y;" + "\n" + "};" + "\n" + "let result = add(five, ten);"; Lexer lexer = new Lexer(input); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "five"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "ten"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "add"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.FUNC, "fn"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "x"))); assertThat(lexer.nextToken(), is(new Token(TokenType.COMMA, ","))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "y"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LBRACE, "{"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "x"))); assertThat(lexer.nextToken(), is(new Token(TokenType.PLUS, "+"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "y"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RBRACE, "}"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "result"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "add"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "five"))); assertThat(lexer.nextToken(), is(new Token(TokenType.COMMA, ","))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "ten"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); }
少し長いですが、入力を Token
に置き換えるだけの単純なものです。
では、このテストを通すように実装を追加していきましょう。
TokenType
にMonkey 言語のキーワードを判別する lookup を追加しておきます。
public enum TokenType { // ... private static Map<String, TokenType> keywords = new HashMap<>(); static { keywords.put("fn", FUNC); keywords.put("let", LET); } public static TokenType lookupIdent(String ident) { return keywords.getOrDefault(ident, IDENT); } }
Lexer
は以下のようになります。
public class Lexer { /** input string. */ private String input; /** current position in input (points to current char). */ private int position; /** current reading position in input (after current char). */ private int readPosition; /** current char under examination. */ private char ch; public Lexer(String input) { this.input = input; this.readPosition = 0; readChar(); } public Token nextToken() { skipWhitespace(); Token token; switch (ch) { case '=' : token = new Token(TokenType.ASSIGN, "="); break; case '+' : token = new Token(TokenType.PLUS, "+"); break; case '(' : token = new Token(TokenType.LPAREN, "("); break; case ')' : token = new Token(TokenType.RPAREN, ")"); break; case '{' : token = new Token(TokenType.LBRACE, "{"); break; case '}' : token = new Token(TokenType.RBRACE, "}"); break; case ',' : token = new Token(TokenType.COMMA, ","); break; case ';' : token = new Token(TokenType.SEMIC, ";"); break; case (byte) 0 : token = new Token(TokenType.EOF, ""); break; default : if (isLetter(ch)) { final String ident = readIdentifier(); return new Token(TokenType.lookupIdent(ident), ident); } else if (isDigit(ch)) { final String num = readNumber(); return new Token(TokenType.INT, num); } else { token = new Token(TokenType.ILLEGAL, ch + ""); } } readChar(); return token; } private void readChar() { if (readPosition >= input.length()) { ch = (byte) 0; } else { ch = input.charAt(readPosition); } position = readPosition; readPosition += 1; } private String readIdentifier() { int pos = position; while (isLetter(ch)) { readChar(); } return input.substring(pos, position); } private String readNumber() { int pos = position; while (isDigit(ch)) { readChar(); } return input.substring(pos, position); } private static boolean isLetter(char c) { return 'a' <= c && c <= 'z' || 'A' <= c && c <= 'Z' || c == '_'; } private static boolean isDigit(char c) { return '0' <= c && c <= '9'; } private void skipWhitespace() { while(ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r') { readChar(); } } }
少し長いですが、入力を1文字ずつ読み、条件に応じて Token
インスタンスを返しているだけですね。
これでテストがグリーンとなります。
続いて、解析できる字句を増やしていきましょう。
1文字トークンの追加
TokenType に「-」「!」「/」「*」「<」「>」を追加します。
public enum TokenType { MINUS("-"), BANG("!"), ASTER("*"), SLASH("/"), LT("<"), GT(">"), }
Lexer
にも追加するだけで対応は完了です。
public Token nextToken() { // ... switch (ch) { case '-' : token = new Token(TokenType.MINUS, "-"); break; case '!' : token = new Token(TokenType.BANG, "!"); break; case '*' : token = new Token(TokenType.ASTER, "*"); break; case '/' : token = new Token(TokenType.SLASH, "/"); break; case '<' : token = new Token(TokenType.LT, "<"); break; case '>' : token = new Token(TokenType.GT, ">"); break; // ... } }
新しいキーワードの追加
TokenType
に新しいキーワード、true
、false
、if
、else
、return
を追加しましょう。
こちらは、キーワードリストへの追加が必要です。
public enum TokenType { //... TRUE("TRUE"), FALSE("FALSE"), IF("IF"), ELSE("ELSE"), RETURN("RETURN"), ; //... private static Map<String, TokenType> keywords = new HashMap<>(); static { keywords.put("fn", FUNC); keywords.put("let", LET); keywords.put("true", TRUE); keywords.put("false", FALSE); keywords.put("if", IF); keywords.put("else", ELSE); keywords.put("return", RETURN); } }
単純にこれだけで対応は完了です。
2文字トークンの追加
続いて2文字のトークン「==」と「!=」です。
同じように TokenType
を追加します。
public enum TokenType { EQ("=="), NOT_EQ("!="), // ... }
Lexer
に次の文字をのぞき見するヘルパーメソッドを追加し、
private char peekChar() { if (readPosition >= input.length()) { return (byte) 0; } else { return input.charAt(readPosition); } }
単純に switch
の中に条件を加えるだけとします。
case '=' : if (peekChar() == '=') { readChar(); token = new Token(TokenType.EQ, "=="); } else { token = new Token(TokenType.ASSIGN, "="); } break; case '!' : if (peekChar() == '=') { readChar(); token = new Token(TokenType.NOT_EQ, "!="); } else { token = new Token(TokenType.BANG, "!"); } break;
これで2文字のトークンが扱えるようになりました。
まとめ
これで、Monkey の字句解析器は完成です。
TokenType
は以下のようになりました。
package monkey.token; import java.util.HashMap; import java.util.Map; public enum TokenType { ILLEGAL("ILLEGAL"), EOF("EOF"), IDENT("IDENT"), // add, foobar, x, y, ... INT("INT"), ASSIGN("="), PLUS("+"), MINUS("-"), BANG("!"), ASTER("*"), SLASH("/"), LT("<"), GT(">"), COMMA(","), SEMIC(";"), LPAREN("("), RPAREN(")"), LBRACE("{"), RBRACE("}"), EQ("=="), NOT_EQ("!="), FUNC("FUNC"), LET("LET"), TRUE("TRUE"), FALSE("FALSE"), IF("IF"), ELSE("ELSE"), RETURN("RETURN"), ; private final String str; TokenType(String str) { this.str = str; } public String str() { return str; } private static Map<String, TokenType> keywords = new HashMap<>(); static { keywords.put("fn", FUNC); keywords.put("let", LET); keywords.put("true", TRUE); keywords.put("false", FALSE); keywords.put("if", IF); keywords.put("else", ELSE); keywords.put("return", RETURN); } public static TokenType lookupIdent(String ident) { return keywords.getOrDefault(ident, IDENT); } }
Token
は以下のようになりました。
package monkey.token; import java.util.Objects; public class Token { public final TokenType type; public final String literal; public Token(TokenType type, String literal) { this.type = type; this.literal = literal; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; Token token = (Token) o; return type == token.type && Objects.equals(literal, token.literal); } @Override public int hashCode() { return Objects.hash(type, literal); } @Override public String toString() { return "Token{" + "type=" + type + ", literal='" + literal + '\'' + '}'; } }
Lexer
は以下のようになりました。
package monkey.lexer; import monkey.token.Token; import monkey.token.TokenType; public class Lexer { /** input string. */ private String input; /** current position in input (points to current char). */ private int position; /** current reading position in input (after current char). */ private int readPosition; /** current char under examination. */ private char ch; public Lexer(String input) { this.input = input; this.readPosition = 0; readChar(); } public Token nextToken() { skipWhitespace(); Token token; switch (ch) { case '=' : if (peekChar() == '=') { readChar(); token = new Token(TokenType.EQ, "=="); } else { token = new Token(TokenType.ASSIGN, "="); } break; case '!' : if (peekChar() == '=') { readChar(); token = new Token(TokenType.NOT_EQ, "!="); } else { token = new Token(TokenType.BANG, "!"); } break; case '+' : token = new Token(TokenType.PLUS, "+"); break; case '-' : token = new Token(TokenType.MINUS, "-"); break; case '*' : token = new Token(TokenType.ASTER, "*"); break; case '/' : token = new Token(TokenType.SLASH, "/"); break; case '<' : token = new Token(TokenType.LT, "<"); break; case '>' : token = new Token(TokenType.GT, ">"); break; case '(' : token = new Token(TokenType.LPAREN, "("); break; case ')' : token = new Token(TokenType.RPAREN, ")"); break; case '{' : token = new Token(TokenType.LBRACE, "{"); break; case '}' : token = new Token(TokenType.RBRACE, "}"); break; case ',' : token = new Token(TokenType.COMMA, ","); break; case ';' : token = new Token(TokenType.SEMIC, ";"); break; case (byte) 0 : token = new Token(TokenType.EOF,""); break; default : if (isLetter(ch)) { final String ident = readIdentifier(); return new Token(TokenType.lookupIdent(ident), ident); } else if (isDigit(ch)) { final String num = readNumber(); return new Token(TokenType.INT, num); } else { token = new Token(TokenType.ILLEGAL, ch + ""); } } readChar(); return token; } private void readChar() { if (readPosition >= input.length()) { ch = (byte) 0; } else { ch = input.charAt(readPosition); } position = readPosition; readPosition += 1; } private char peekChar() { if (readPosition >= input.length()) { return (byte) 0; } else { return input.charAt(readPosition); } } private String readIdentifier() { int pos = position; while (isLetter(ch)) { readChar(); } return input.substring(pos, position); } private String readNumber() { int pos = position; while (isDigit(ch)) { readChar(); } return input.substring(pos, position); } private static boolean isLetter(char c) { return 'a' <= c && c <= 'z' || 'A' <= c && c <= 'Z' || c == '_'; } private static boolean isDigit(char c) { return '0' <= c && c <= '9'; } private void skipWhitespace() { while(ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r') { readChar(); } } }
最後に Lexer
のテストケースです。
package monkey.lexer; import monkey.token.Token; import monkey.token.TokenType; import org.junit.Test; import static org.hamcrest.core.Is.is; import static org.junit.Assert.*; public class LexerTest { @Test public void nextToken() { String input = "let five = 5;" + "\n" + "let ten = 10;" + "\n" + "let add = fn(x, y) {" + "\n" + " x + y;" + "\n" + "};" + "\n" + "let result = add(five, ten);" + "\n" + "!-/*5;" + "\n" + "5 < 10 > 5;" + "\n" + "if (5 < 10) {" + "\n" + " return true;" + "\n" + "} else {" + "\n" + " return false;" + "\n" + "}" + "\n" + "10 == 10;" + "\n" + "10 != 9;"; Lexer lexer = new Lexer(input); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "five"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "ten"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "add"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.FUNC, "fn"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "x"))); assertThat(lexer.nextToken(), is(new Token(TokenType.COMMA, ","))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "y"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LBRACE, "{"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "x"))); assertThat(lexer.nextToken(), is(new Token(TokenType.PLUS, "+"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "y"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RBRACE, "}"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LET, "let"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "result"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASSIGN, "="))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "add"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "five"))); assertThat(lexer.nextToken(), is(new Token(TokenType.COMMA, ","))); assertThat(lexer.nextToken(), is(new Token(TokenType.IDENT, "ten"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.BANG, "!"))); assertThat(lexer.nextToken(), is(new Token(TokenType.MINUS, "-"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SLASH, "/"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ASTER, "*"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LT, "<"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.GT, ">"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.IF, "if"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LPAREN, "("))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "5"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LT, "<"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RPAREN, ")"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LBRACE, "{"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RETURN, "return"))); assertThat(lexer.nextToken(), is(new Token(TokenType.TRUE, "true"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RBRACE, "}"))); assertThat(lexer.nextToken(), is(new Token(TokenType.ELSE, "else"))); assertThat(lexer.nextToken(), is(new Token(TokenType.LBRACE, "{"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RETURN, "return"))); assertThat(lexer.nextToken(), is(new Token(TokenType.FALSE, "false"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.RBRACE, "}"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.EQ, "=="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "10"))); assertThat(lexer.nextToken(), is(new Token(TokenType.NOT_EQ, "!="))); assertThat(lexer.nextToken(), is(new Token(TokenType.INT, "9"))); assertThat(lexer.nextToken(), is(new Token(TokenType.SEMIC, ";"))); } }
次回は構文解析器です。