Parser combinators are what you end up with when you start factoring out common pieces of functionality from recursive descent parsing. They are higher order functions that can be combined in modular ways to create a desired parser.
However they also inherit the drawbacks of recursive descent parsing, and in particular recursive descent parsing with backtracking. If the grammar that the parser is designed to accept contains left recursion then the parser will loop infinitely. If the grammar is ambiguous then only one result will be obtained. And any result may require exponential time and space to calculate.
This library, based on the paper linked at the top, solves all those problems and a few bits more. As an example, the following grammar portion:
Expression ::= Expression - Term | Term Term ::= Term * Factor | Factor Factor ::= ( Expression ) | Integer
Can be turned into the following code snippet:
package Expr_Redir is new Redirect; package Term_Redir is new Redirect; function Left_Parens is new Match ('('); function Right_Parens is new Match (')'); function Fac_Expr is new Between (Left_Parens, Expr_Redir.Call, Right_Parens); function Fac_Choice is new Choice_2 (Fac_Expr, Integer_Num); function Factor is new Stamp (Factor_Label, Fac_Choice); function Multiply is new Match ('*'); function Term_Mult is new Sequence (Term_Redir.Call'Access, Multiply'Access, Factor'Access); function Term_Choice is new Choice_2 (Term_Mult, Factor); function Term is new Stamp (Term_Label, Term_Choice); function Minus is new Match ('-'); function Expr_Minus is new Sequence (Expr_Redir.Call'Access, Minus'Access, Term'Access); function Expr_Choice is new Choice_2 (Expr_Minus, Term); function Expression is new Stamp (Expression_Label, Expr_Choice);
Most of the verbosity is caused by the need to individually instantiate each combinator, as generics are used to serve the same purpose as higher order functions. Some bits are also omitted, such as the label enumeration and the actual setting of the redirectors. But the above should provide a good general impression.
A list of features that this library provides includes:
More thorough documentation is provided in the /doc directory.
The name of the library comes from packrat parsing which is a parsing algorithm that avoids exponential time complexity by memoizing all intermediate results. As that is what this library does, both so as to reduce the time complexity and to enable piecewise parsing, the name seemed appropriate.
Direct left recursion, meaning a grammar non-terminal that immediately recurses to itself on the left as in the Expression or Term used above, is fairly easy to handle. A counter is used to keep track of how many times a particular combinator has been applied to the input at a particular position, and when that counter exceeds the number of unconsumed input tokens plus one the result is curtailed. This is explained on pages 7 and 8 of the paper.
The really thorny issue that caused the most problems with this library is indirect left recursion. This is when a non-terminal recurses to itself on the left only after first evaluating to one or more other non-terminals. Curtailment in these circumstances can easily cause those other non-terminals to also be curtailed, and reusing the results for those other non-terminals may be incorrect. This issue along with a proposed solution is explained on page 9 of the paper. However that solution was not as clear as would be preferred. So some minor rephrasing and reworking was needed.
Bringing this problem back to the start: What are we really doing when we curtail a result due to left recursion? It is not a matter of cutting off branches in a parse tree. We are identifying conditions where the parse result of a particular non-terminal can be calculated without further recursion. The word "curtailment" is somewhat misleading in this regard. Once this reframing is done then a clearer view immediately follows.
What is the condition? Exactly as described above for direct left recursion. Through comparing recursion counts with the amount of unconsumed input we determine that a result of no successful parse can be calculated, and that the result is valid for reuse for any deeper recursion of the same combinator at that input position.
From that can be derived:
So far the above list just covers what is in the paper. But there is a little more that can be inferred:
These two details should constitute a minor efficiency improvement.
While the polynomial complexity of this library has been experimentally confirmed, no effort has yet been made to prove that it is actually polynomial in the same way that the parser combinators in the paper are. It is possible that due to the changes involved with using a non-functional language and enabling piecewise parsing that some subtle complexity difference may have arisen.
Likewise, the piecewise parsing has been unit tested to a degree but no formal proof that it is correct has been done.
Ideas like being able to modify and resume an erroneous parsing attempt would also be interesting to explore.
Finally, the plan is to actually use this library for something significant at some point in the future.