Benjamin Johnston - Prototyping grammars in JavaScript / TypeScript

Context-free grammars are a popular technique for parsing input in software development. Implementing a grammar can be difficult or require complex dependencies. Ambiguous and recursive grammars can be particularly challenging.

Read on for a technique I like to use when rapidly prototype grammars in JavaScript or TypeScript. The technique uses yield to simulate Prolog's definite clause grammars in JavaScript.

Generators in JavaScript

This technique depends on a seldom used feature of JavaScript: generator functions.

A generator function can “return” multiple answers using the yield keyword. It is declared using function* (i.e., there's an asterisk in the declaration). For example:

function* names() {
    yield "Alice";
    yield "Bob";
}

yield is similar to return, but the results are accessed by iteration. The following code will print the names Alice and Bob on separate lines.

for (const name of names()) {
    console.log(name);
}

Generator functions do not return an array. Instead, the names() generator function is “paused” each time yield is encountered, so that one value is returned at a time. This means that there is no harm in yielding many or infinitely many values.

For example, even though the following code contains an infinite loop, each value is computed on demand. The function is paused on each invocation.

function* forever() {
    let i = 0;
    while (true)
        yield i++;
}

The following code will print only the number 0 - it does not crash!

forever().next().value // Prints only the first value yielded

Because generator functions are not widely familiar, it is good practice to avoid their use in everyday code.

The ability to “pause” between each yield is sometimes used to implement custom async/await-style logic. However, they can also be used to allow a function to return multiple answers.

A Sample Grammar

For this article, I'll implement a simple grammar:

<expr> ::= <appraisal> | <antiappraisal>

<antiappraisal> ::= <not> <appraisal>
<not> ::= "not "

<appraisal> ::= <positive> | <negative>
<positive> ::= "good" | "excellent"
<negative> ::= "bad" | "terrible"

The language of this grammar is simple appraisals: e.g., "good", "not good", "bad", "not terrible".

Instead of a parse tree, I will have the parser output true for overall positive appraisals (e.g., "good", "not bad") and false for overall negative appraisals (e.g., "terrible", "not excellent").

Parsing

A parser is a function that takes a string as input and then outputs a parse tree.

To ease implementation, I use a more general definition.

In particular, I allow a parser to begin anywhere in an input string, and perform a partial parse from that starting point. Furthermore, if the grammar is ambiguous, it can return multiple parses. That is, a parser is a generator that accepts a string and a start position, and yields parse trees with their end positions.

In TypeScript:

type ParserGenerator<T> = Generator<[T, number]>;
type Parser<T> = (input: string, start: number) => Generator<[T, number]>;

Here is an example of a parser for <positive>:

const positive: Parser<boolean> = function *(input, start) {
    if (input.substring(start).startsWith("good"))
        yield [true, start + "good".length];
    if (input.substring(start).startsWith("excellent"))
        yield [true, start + "excellent".length];
};

The function will produce parses for positive strings:

for (const parse of positive("good", 0))
    console.log(parse); // prints [true, 4]

for (const parse of positive("excellent", 0))
    console.log(parse); // prints [true, 9]

for (const parse of positive("blah blah", 0))
    console.log(parse); // no results -- prints nothing

for (const parse of positive("good dog", 0))
    console.log(parse); // prints [true, 4] 
                        // (i.e., partial parse)

The parser yields possible parses (i.e., the “solutions”) consisting of the parse tree (true or false) and the end position. It yields nothing if there is no match.

The same approach can be used to implement <negative>:

const negative: Parser<boolean> = function *(input, start) {
    if (input.substring(start).startsWith("bad"))
        yield [false, start + "bad".length];
    if (input.substring(start).startsWith("terrible"))
        yield [false, start + "terrible".length];
};

... as well as <not>:

const not: Parser<undefined> = function *(input, start) {
    if (input.substring(start).startsWith("not "))
        yield [undefined, start + "not ".length];
};

The output of two generators can be combined by iterating over each and yielding their results. This allows us to combine parsers as alternatives, as required for <appraisal>:

const appraisal: Parser<boolean> = function *(input, start) {
    for (const parse of positive(input, start))
        yield parse;
    for (const parse of negative(input, start))
        yield parse;
};

The code above works by first yielding every parse found by positive, then every parse found by negative.

In <antiappraisal>, grammar rules are combined in sequence. This is implemented by using the end of one rule as the start of the next rule:

const antiappraisal: Parser<boolean> = function *(input, start) {
    for (const [ _, notEnd ] of not(input, start))
        for (const [ value, appraisalEnd] of appraisal(input, notEnd))
            // Yield the boolean negation of the parsed value
            yield [!value, appraisalEnd];
};

Finally, the <expr> rule combines two generators, similar to the implementation of the <appraisal> rule:

const expr: Parser<boolean> = function *(input, start) {
    for (const parse of appraisal(input, start))
        yield parse;
    for (const parse of antiappraisal(input, start))
        yield parse;
};

To use this parser, it is invoked with some input, starting from 0:

for (const [value, end] of expr("good", 0))
    console.log(value); // outputs true


for (const [value, end] of expr("not good", 0))
    console.log(value); // outputs false

Metaprogramming

The beauty of having a standard type for every parser is that parsers combine in interesting ways.

For example, a helper can find the first complete parse (i.e., exclude partial parses):

function parseCompletely<T>(parser: Parser<T>, input: string): T | undefined {
    for (const [value, end] of parser(input, 0))
        if (end === input.length)
            return value; // RETURN the first matching value (and stop)

    return undefined; // No matching value
}

// Usage:
console.log(parseCompletely(expr, "not good"));

A generic “or” rule can choose between alternatives (by iterating over each alternative, and yielding each result):

const or: <T,>(...alternatives: Parser<T>[]) => Parser<T> =
    (...alternatives) => function *(input, start) {
        for (const alternative of alternatives)
            for (const parse of alternative(input, start))
                yield parse;
    };

// Usage:
const appraisal = or(positive, negative);
const expr = or(appraisal, antiappraisal);

A generic “token” rule can recognize a fixed string:

const token: (pattern: string) => Parser<string> =
    pattern => function *(input, start) {
        if (input.substring(start).startsWith(pattern))
            yield [pattern, start + pattern.length];
    };

// Usage:
const positiveToken = or(token("good"), token("excellent"));
const negativeToken = or(token("bad"), token("terrible"));

A map function can transform the parse tree output by a parser:

const map: <A,B>(parser: Parser<A>, mapper: (value: A) => B) => Parser<B> =
    (parser, mapper) => function *(input, start) {
        for (const [value, end] of parser(input, start))
            yield [mapper(value), end];
    };

// Usage:
const positive = map(positiveToken, _ => true);
const negative = map(negativeToken, _ => false);

Finally, a generic sequencing function can combine two parsers:

const and2: <A,B>(a: Parser<A>, b:Parser<B>) => Parser<[A,B]> = 
    (a, b) => function *(input, start) {
        for (const [ aValue, aEnd] of a(input, start))
            for (const [ bValue, bEnd] of b(input, aEnd))
                yield [[aValue, bValue], bEnd];
    };

Now, I admit that the code above to combine parsers is not entirely intuitive, especially to those unfamiliar to the technique. However, the approach simplifies grammar implementation.

For example, the following is the complete appraisal grammar constructed with these parser combining rules:

const positive = map(or(token("good"), token("excellent")), _ => true);
const negative = map(or(token("bad"), token("terrible")), _ => false);
const appraisal = or(positive, negative);
const not = token("not ");
const antiappraisal = map(and2(not, appraisal), ([_not, appr]) => !appr);
const expr = or(appraisal, antiappraisal);

// Usage:
console.log(parseCompletely(expr, "excellent"));    // returns true
console.log(parseCompletely(expr, "not terrible")); // returns true
console.log(parseCompletely(expr, "terrible"));     // returns false
console.log(parseCompletely(expr, "not good"));     // returns false

Definite Clause Grammars

This approach has many similarities to top-down parser implementation strategies. I was specifically inspired by definite clause grammars.

In the Prolog programming language, definite clause grammars (DCG) are an elegant approach to implementing a grammar:

expr(A) --> appraisal(A).
expr(A) --> antiappraisal(A).

antiappraisal(true) --> not, appraisal(false).
antiappraisal(false) --> not, appraisal(true).
not --> "not ".

appraisal(true) --> positive.
appraisal(false) --> negative.

positive --> "good".
positive --> "excellent".

negative --> "bad".
negative --> "terrible".

Behind the scenes, these DCG rules are expanded into ordinary Prolog rules by adding two parameters:

expr(A, H, T) :- appraisal(A, H, T).
expr(A, H, T) :- antiappraisal(A, H, T).

antiappraisal(true, H, T):- not(H, T1), appraisal(false, T1, T).
antiappraisal(false, H, T):- not(H, T1), appraisal(true, T1, T).
not(H, T):- append(`not `, T, H).

appraisal(true, H, T) :- positive(H, T).
appraisal(false, H, T) :- negative(H, T).

positive(H, T) :- append(`good`, T, H).
positive(H, T) :- append(`excellent`, T, H).

negative(H, T) :- append(`bad`, T, H).
negative(H, T) :- append(`terrible`, T, H).

These two additional parameters track the start and end positions of the parser in the input (H and T standing for Head and Tail).

Prolog executes using a search algorithm that tests many possibilities.

I have translated these concepts into JavaScript by replacing search with yield, and by replacing the Head and Tail lists by array indexes.

Going Further

This approach can be extended to build more complex grammars. First, a few extra operations are required.

A rule that matches the empty string:

const empty: <T>(value: T) => Parser<T> = 
    value => function *(_input, start) {
        yield [value, start];
    };

A generic version of and2 that allows any number of parsers to be combined in sequence:

const and: <T>(...steps: Parser<T>[]) => Parser<T[]> = 
    (...steps) => {
        if (steps.length === 0)
            return empty([]);
        else {
            const [first, ...rest] = steps;
            const restSequence = and(...rest);
            return map(and2(first, restSequence), ([firstValue, restValues]) => [firstValue, ...restValues]);
        }
    };

A regular-expression-matching rule:

const match: (pattern: RegExp) => Parser<string> =
    pattern => function *(input, start) {
        const match = input.substring(start).match(pattern);
        if (match && match.index === 0)
            yield [match[0], start + match[0].length]
    };

A whitespace-matching rule:

const whitespace = match(/\s+/);

An integer-matching rule:

const integer = map(match(/-?\d+/), value => parseInt(value));

Recursive Grammars

Recursive grammars require rules to be defined out-of-order. In other words, the grammar may refer to a rule before it is defined.

A “placeholder” makes this possible:

const nothing: Parser<any> = function *(_input, _start) {};

const placeholder: <T>() => [Parser<T>, (parser: Parser<T>) => void] =
    <T>() => {
        let target: Parser<T> = nothing;
        return [
            (input, start) => target(input, start),
            (parser: Parser<T>) => { target = parser; }
        ]
    };

placeholder returns a “placeholder” rule that can be used in other rules, along with a setter to fix its implementation later.

For example, given a recursive grammar for simple expressions including addition (+) and multiplication (*):

<sum> ::= <product> | <product> "+" <sum>
<product> ::= <value> | <value> "*" <product>
<value> ::= <integer> | "(" <sum> ")"

The placeholder can be used to generate that grammar:

const [ sum, setSum ] = placeholder<number>();
const [ product, setProduct ] = placeholder<number>();
const [ value, setValue ] = placeholder<number>();

// <sum> ::= <product> | <product> "+" <sum>
setSum(
    or(
        product,
        map(and<any>(product,token("+"),sum), ([a, _, b]) => a + b)
    )
);

// <product> ::= <value> | <value> "*" <product>
setProduct(
    or(
        value,
        map(and<any>(value,token("*"),product), ([a, _, b]) => a * b)
    )
);

// <value> ::= <integer> | "(" <sum> ")"
setValue(
    or(
        integer,
        map(and<any>(token('('), sum, token(')')), ([_l, a, _r]) => a as number)
    )
);

// Usage:
console.log(parseCompletely(sum, "1+2*3+(4+5)*6")); // prints 61

An advantage of this technique is that it will work even with ambiguous grammars (i.e., if there are multiple valid parses of a given string).

However, there is one disadvantage: it cannot be used with left-recursive grammars.

As an extreme example, consider that the following left-recursive grammar would obviously result in an infinite loop"

const [ loop, setLoop ] = placeholder<number>();

// <loop> ::= <loop>
setLoop(loop); // results in a stack overflow

Conclusion

I have found this method to be a fast way to prototype grammars. While the higher order rules (and and placeholder) may not be intuitively obvious, they have the benefit of being only a few lines of code. The type checks provided by TypeScript make it surprisingly difficult to make errors when using this approach to grammar construction.

After prototyping, it may make sense to reimplement the parser manually: using a parser generator, a parser library or writing a top-down parser from scratch (i.e., to produce better explanations for users when the input does not match the grammar).