­čÄäAnlr4 and other parser generators­čîč

Dec 11, 2023

Back to 2023 Advent Calendar

Why a senior developer in 2024 must know about parser generators?

There are 2 primary use cases for parser generators:

  1. Create your own language/DSL for example, you can create a language for your business users to write business rules.
  2. Parse existing languages for example, you can parse a language like SQL and convert it to a different language like Elasticsearch DSL.

How to use parser generators?

  1. First you create a grammar file that describes the language you want to parse. for example,
grammar CSV;

csvFile
    : hdr row+ EOF
    ;

hdr
    : row
    ;
...
  1. Then you generate a lexer and parser from the grammar file. for example, you can use Anlr4 to generate a lexer and parser in Java.
java -jar antlr-4.9.2-complete.jar CSV.g4
  1. Then you can use the generated lexer and parser to parse the language. for example, you can use the generated lexer and parser to parse a CSV file.
public static void main(String[] args) throws IOException {
    CSVLexer lexer = new CSVLexer(CharStreams.fromFileName("test.csv"));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    ParseTree tree = parser.csvFile();
    System.out.println(tree.toStringTree(parser));
}
  1. generated code usually contains a visitor pattern that you can use to traverse the parse tree. for example, you can use the visitor pattern to convert a CSV file to JSON.
public class CSVToJsonVisitor ...
    @Override
    public String visitCsvFile(CSVParser.CsvFileContext ctx) {
        return visitChildren(ctx);
    }

    @Override
    public String visitRow(CSVParser.RowContext ctx) {
        ,,,
    }

    @Override
    public String visitField(CSVParser.FieldContext ctx) {
        ...
    }
}

Why can't I just use regular expressions?

Regular expressions are great for simple use cases, but they are not powerful enough to parse complex languages. For example, you can't use regular expressions to parse nested expressions like JSON or XML.