1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
|
{% extends "base.html" %}
{% block title %}Packrat Parser Combinator Library{% endblock %}
{% block content %}
<h4>Packrat Parser Combinator Library</h4>
<p>Git repository: <a href="/cgit/cgit.cgi/packrat">Link</a><br>
Paper this was based on: <a href="http://richard.myweb.cs.uwindsor.ca/PUBLICATIONS/PREPRINT_PADL_NOV_07.pdf">Link</a></p>
<h5>2/2/2021</h5>
<h5>Overview</h5>
<p>Parser combinators are what you end up with when you start factoring out common pieces of functionality from
<a href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">recursive descent parsing</a>. They are higher order
functions that can be combined in modular ways to create a desired parser.</p>
<p>However they also inherit the drawbacks of recursive descent parsing, and in particular recursive descent
parsing with backtracking. If the grammar that the parser is designed to accept contains left recursion then
the parser will loop infinitely. If the grammar is ambiguous then only one result will be obtained. And any result
may require exponential time and space to calculate.</p>
<p>This library, based on the paper linked at the top, solves all those problems and a few bits more. As an
example, the following grammar portion:</p>
<div class="precontain">
<pre>
Expression ::= Expression - Term | Term
Term ::= Term * Factor | Factor
Factor ::= ( Expression ) | Integer
</pre>
</div>
<p>Can be turned into the following code snippet:</p>
<div class="precontain">
<code>
package Expr_Redir is new Redirect;
package Term_Redir is new Redirect;
function Left_Parens is new Match ('(');
function Right_Parens is new Match (')');
function Fac_Expr is new Between (Left_Parens, Expr_Redir.Call, Right_Parens);
function Fac_Choice is new Choice_2 (Fac_Expr, Integer_Num);
function Factor is new Stamp (Factor_Label, Fac_Choice);
function Multiply is new Match ('*');
function Term_Mult is new Sequence (Term_Redir.Call'Access, Multiply'Access, Factor'Access);
function Term_Choice is new Choice_2 (Term_Mult, Factor);
function Term is new Stamp (Term_Label, Term_Choice);
function Minus is new Match ('-');
function Expr_Minus is new Sequence (Expr_Redir.Call'Access, Minus'Access, Term'Access);
function Expr_Choice is new Choice_2 (Expr_Minus, Term);
function Expression is new Stamp (Expression_Label, Expr_Choice);
</code>
</div>
<p>Most of the verbosity is caused by the need to individually instantiate each combinator, as generics are
used to serve the same purpose as higher order functions. Some bits are also omitted, such as the label
enumeration and the actual setting of the redirectors. But the above should provide a good general
impression.</p>
<h5>Features</h5>
<p>A list of features that this library provides includes:</p>
<ul>
<li>Higher order combinator functions in Ada, a language that does not support functional programming</li>
<li>Both parser combinators and simpler lexer combinators are available</li>
<li>Input can be any array, whether that is text strings or otherwise</li>
<li>Left recursive grammars are parsed correctly with no infinite loops</li>
<li>Ambiguity is handled by incorporating all possible valid options into the resulting parse tree</li>
<li>Parsing and lexing can both be done piecewise, providing input in successive parts instead of all at once</li>
<li>Error messages are generated when applicable that note what would have been needed and where for a successful parse</li>
<li>All of the above is accomplished in polynomial worst case time and space complexity</li>
</ul>
<p>More thorough documentation is provided in the <em>/doc</em> directory.</p>
<p>The name of the library comes from <a href="https://bford.info/pub/lang/packrat-icfp02.pdf">packrat parsing</a>
which is a parsing algorithm that avoids exponential time complexity by memoizing all intermediate results. As that
is what this library does, both so as to reduce the time complexity and to enable piecewise parsing, the name
seemed appropriate.</p>
<h5>Left Recursion</h5>
<p>Direct left recursion, meaning a grammar non-terminal that immediately recurses to itself on the left as in
the <em>Expression</em> or <em>Term</em> used above, is fairly easy to handle. A counter is used to keep track
of how many times a particular combinator has been applied to the input at a particular position, and when that
counter exceeds the number of unconsumed input tokens plus one the result is curtailed. This is explained on
pages 7 and 8 of the paper.</p>
<p>The really thorny issue that caused the most problems with this library is indirect left recursion. This is
when a non-terminal recurses to itself on the left only after first evaluating to one or more other non-terminals.
Curtailment in these circumstances can easily cause those other non-terminals to also be curtailed, and reusing
the results for those other non-terminals may be incorrect. This issue along with a proposed solution is
explained on page 9 of the paper. However that solution was not as clear as would be preferred. So some minor
rephrasing and reworking was needed.</p>
<p>Bringing this problem back to the start: What are we really doing when we curtail a result due to left
recursion? It is not a matter of cutting off branches in a parse tree. We are identifying conditions where the
parse result of a particular non-terminal can be calculated without further recursion. The word "curtailment" is
somewhat misleading in this regard. Once this reframing is done then a clearer view immediately follows.</p>
<p>What is the condition? Exactly as described above for direct left recursion. Through comparing recursion counts
with the amount of unconsumed input we determine that a result of no successful parse can be calculated, and that
the result is valid for reuse for any deeper recursion of the same combinator at that input position.</p>
<p>From that can be derived:</p>
<ul>
<li>When merging two results that have different left recursion count conditions for the same non-terminal,
the larger count should be used</li>
<li>Conditions of subresults should also be made part of any result that includes those subresults</li>
<li>Any memoized result is only reusable if all the recursion count conditions of the stored result are less
than or equal to the recursion counts for the current input position</li>
</ul>
<p>So far the above list just covers what is in the paper. But there is a little more that can be inferred:</p>
<ul>
<li>If a result is not reusable and a new result is calculated, then the recursion count conditions of the
old result should be updated to the recursion counts at the current position and applied to the new result</li>
<li>When the recursion count of a condition applied to a result plus the number of unconsumed input tokens after
the result is less than the number of input tokens available at the beginning of the result, then that
condition can be omitted from the result</li>
</ul>
<p>These two details should constitute a minor efficiency improvement.</p>
<h5>Further Work</h5>
<p>While the polynomial complexity of this library has been experimentally confirmed, no effort has yet been
made to prove that it is actually polynomial in the same way that the parser combinators in the paper are. It
is possible that due to the changes involved with using a non-functional language and enabling piecewise
parsing that some subtle complexity difference may have arisen.</p>
<p>Likewise, the piecewise parsing has been unit tested to a degree but no formal proof that it is correct has
been done.</p>
<p>Ideas like being able to modify and resume an erroneous parsing attempt would also be interesting to explore.</p>
<p>Finally, the plan is to actually use this library for something significant at some point in the future.</p>
{% endblock %}
|