Packrat Parser Combinator Library

Git repository: Link
Paper this was based on: Link

2/2/2021

Overview

Parser combinators are what you end up with when you start factoring out common pieces of functionality from recursive descent parsing. They are higher order functions that can be combined in modular ways to create a desired parser.

However they also inherit the drawbacks of recursive descent parsing, and in particular recursive descent parsing with backtracking. If the grammar that the parser is designed to accept contains left recursion then the parser will loop infinitely. If the grammar is ambiguous then only one result will be obtained. And any result may require exponential time and space to calculate.

This library, based on the paper linked at the top, solves all those problems and a few bits more. As an example, the following grammar portion:

Expression ::= Expression - Term | Term
Term ::= Term * Factor | Factor
Factor ::= ( Expression ) | Integer

Can be turned into the following code snippet:


package Expr_Redir is new Redirect;
package Term_Redir is new Redirect;

function Left_Parens is new Match ('(');
function Right_Parens is new Match (')');
function Fac_Expr is new Between (Left_Parens, Expr_Redir.Call, Right_Parens);
function Fac_Choice is new Choice_2 (Fac_Expr, Integer_Num);
function Factor is new Stamp (Factor_Label, Fac_Choice);

function Multiply is new Match ('*');
function Term_Mult is new Sequence (Term_Redir.Call'Access, Multiply'Access, Factor'Access);
function Term_Choice is new Choice_2 (Term_Mult, Factor);
function Term is new Stamp (Term_Label, Term_Choice);

function Minus is new Match ('-');
function Expr_Minus is new Sequence (Expr_Redir.Call'Access, Minus'Access, Term'Access);
function Expr_Choice is new Choice_2 (Expr_Minus, Term);
function Expression is new Stamp (Expression_Label, Expr_Choice);

Most of the verbosity is caused by the need to individually instantiate each combinator, as generics are used to serve the same purpose as higher order functions. Some bits are also omitted, such as the label enumeration and the actual setting of the redirectors. But the above should provide a good general impression.

Features

A list of features that this library provides includes:

Higher order combinator functions in Ada, a language that does not support functional programming
Both parser combinators and simpler lexer combinators are available
Input can be any array, whether that is text strings or otherwise
Left recursive grammars are parsed correctly with no infinite loops
Ambiguity is handled by incorporating all possible valid options into the resulting parse tree
Parsing and lexing can both be done piecewise, providing input in successive parts instead of all at once
Error messages are generated when applicable that note what would have been needed and where for a successful parse
All of the above is accomplished in polynomial worst case time and space complexity

More thorough documentation is provided in the /doc directory.

The name of the library comes from packrat parsing which is a parsing algorithm that avoids exponential time complexity by memoizing all intermediate results. As that is what this library does, both so as to reduce the time complexity and to enable piecewise parsing, the name seemed appropriate.

Left Recursion

Direct left recursion, meaning a grammar non-terminal that immediately recurses to itself on the left as in the Expression or Term used above, is fairly easy to handle. A counter is used to keep track of how many times a particular combinator has been applied to the input at a particular position, and when that counter exceeds the number of unconsumed input tokens plus one the result is curtailed. This is explained on pages 7 and 8 of the paper.

The really thorny issue that caused the most problems with this library is indirect left recursion. This is when a non-terminal recurses to itself on the left only after first evaluating to one or more other non-terminals. Curtailment in these circumstances can easily cause those other non-terminals to also be curtailed, and reusing the results for those other non-terminals may be incorrect. This issue along with a proposed solution is explained on page 9 of the paper. However that solution was not as clear as would be preferred. So some minor rephrasing and reworking was needed.

Bringing this problem back to the start: What are we really doing when we curtail a result due to left recursion? It is not a matter of cutting off branches in a parse tree. We are identifying conditions where the parse result of a particular non-terminal can be calculated without further recursion. The word "curtailment" is somewhat misleading in this regard. Once this reframing is done then a clearer view immediately follows.

What is the condition? Exactly as described above for direct left recursion. Through comparing recursion counts with the amount of unconsumed input we determine that a result of no successful parse can be calculated, and that the result is valid for reuse for any deeper recursion of the same combinator at that input position.

From that can be derived:

When merging two results that have different left recursion count conditions for the same non-terminal, the larger count should be used
Conditions of subresults should also be made part of any result that includes those subresults
Any memoized result is only reusable if all the recursion count conditions of the stored result are less than or equal to the recursion counts for the current input position

So far the above list just covers what is in the paper. But there is a little more that can be inferred:

If a result is not reusable and a new result is calculated, then the recursion count conditions of the old result should be updated to the recursion counts at the current position and applied to the new result
When the recursion count of a condition applied to a result plus the number of unconsumed input tokens after the result is less than the number of input tokens available at the beginning of the result, then that condition can be omitted from the result

These two details should constitute a minor efficiency improvement.

Further Work

While the polynomial complexity of this library has been experimentally confirmed, no effort has yet been made to prove that it is actually polynomial in the same way that the parser combinators in the paper are. It is possible that due to the changes involved with using a non-functional language and enabling piecewise parsing that some subtle complexity difference may have arisen.

Likewise, the piecewise parsing has been unit tested to a degree but no formal proof that it is correct has been done.

Ideas like being able to modify and resume an erroneous parsing attempt would also be interesting to explore.

Finally, the plan is to actually use this library for something significant at some point in the future.

Jedidiah Barber's Personal Site