1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
|
{%- extends "base_plain.xhtml" -%}
{%- block title -%}Packrat Parser Combinator Library{%- endblock -%}
{%- block footer -%}{{ plain_footer ("packrat.xhtml") }}{%- endblock -%}
{%- block content %}
<h4>Packrat Parser Combinator Library</h4>
<p>Git repository: <a href="/cgi-bin/cgit.cgi/packrat">Link</a><br />
Paper this was based on: <a href="http://richard.myweb.cs.uwindsor.ca/PUBLICATIONS/PREPRINT_PADL_NOV_07.pdf"
class="external">Link</a></p>
<h5>2/2/2021</h5>
<h5>Overview</h5>
<p>Parser combinators are what you end up with when you start factoring out common pieces of
functionality from <a href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm" class="external">
recursive descent parsing</a>. They are higher order functions that can be combined in modular ways
to create a desired parser.</p>
<p>However they also inherit the drawbacks of recursive descent parsing, and in particular recursive
descent parsing with backtracking. If the grammar that the parser is designed to accept contains
left recursion then the parser will loop infinitely. If the grammar is ambiguous then only one
result will be obtained. And any result may require exponential time and space to calculate.</p>
<p>This library, based on the paper linked at the top, solves all those problems and a few bits
more. As an example, the following grammar portion:</p>
<div class="precontain">
<pre>
Expression ::= Expression - Term | Term
Term ::= Term * Factor | Factor
Factor ::= ( Expression ) | Integer
</pre>
</div>
<p>Can be turned into the following code snippet:</p>
<div class="precontain">
<code>
package Expr_Redir is new Redirect;
package Term_Redir is new Redirect;
function Left_Parens is new Match ('(');
function Right_Parens is new Match (')');
function Fac_Expr is new Between (Left_Parens, Expr_Redir.Call, Right_Parens);
function Fac_Choice is new Choice_2 (Fac_Expr, Integer_Num);
function Factor is new Stamp (Factor_Label, Fac_Choice);
function Multiply is new Match ('*');
function Term_Mult is new Sequence (Term_Redir.Call'Access, Multiply'Access, Factor'Access);
function Term_Choice is new Choice_2 (Term_Mult, Factor);
function Term is new Stamp (Term_Label, Term_Choice);
function Minus is new Match ('-');
function Expr_Minus is new Sequence (Expr_Redir.Call'Access, Minus'Access, Term'Access);
function Expr_Choice is new Choice_2 (Expr_Minus, Term);
function Expression is new Stamp (Expression_Label, Expr_Choice);
</code>
</div>
<p>Most of the verbosity is caused by the need to individually instantiate each combinator, as
generics are used to serve the same purpose as higher order functions. Some bits are also omitted,
such as the label enumeration and the actual setting of the redirectors. But the above should
provide a good general impression.</p>
<h5>Features</h5>
<p>A list of features that this library provides includes:</p>
<ul>
<li>Higher order combinator functions in Ada, a language that does not support functional
programming</li>
<li>Both parser combinators and simpler lexer combinators are available</li>
<li>Input can be any array, whether that is text strings or otherwise</li>
<li>Left recursive grammars are parsed correctly with no infinite loops</li>
<li>Ambiguity is handled by incorporating all possible valid options into the resulting parse tree
</li>
<li>Parsing and lexing can both be done piecewise, providing input in successive parts instead of
all at once</li>
<li>Error messages are generated when applicable that note what would have been needed and where
for a successful parse</li>
<li>All of the above is accomplished in polynomial worst case time and space complexity</li>
</ul>
<p>More thorough documentation is provided in the <em>/doc</em> directory.</p>
<p>The name of the library comes from <a href="https://bford.info/pub/lang/packrat-icfp02.pdf"
class="external">packrat parsing</a> which is a parsing algorithm that avoids exponential time
complexity by memoizing all intermediate results. As that is what this library does, both so as to
reduce the time complexity and to enable piecewise parsing, the name seemed appropriate.</p>
<h5>Left Recursion</h5>
<p>Direct left recursion, meaning a grammar non-terminal that immediately recurses to itself on the
left as in the <em>Expression</em> or <em>Term</em> used above, is fairly easy to handle. A counter
is used to keep track of how many times a particular combinator has been applied to the input at a
particular position, and when that counter exceeds the number of unconsumed input tokens plus one
the result is curtailed. This is explained on pages 7 and 8 of the paper.</p>
<p>The really thorny issue that caused the most problems with this library is indirect left
recursion. This is when a non-terminal recurses to itself on the left only after first evaluating to
one or more other non-terminals. Curtailment in these circumstances can easily cause those other
non-terminals to also be curtailed, and reusing the results for those other non-terminals may be
incorrect. This issue along with a proposed solution is explained on page 9 of the paper. However
that solution was not as clear as would be preferred. So some minor rephrasing and reworking was
needed.</p>
<p>Bringing this problem back to the start: What are we really doing when we curtail a result due to
left recursion? It is not a matter of cutting off branches in a parse tree. We are identifying
conditions where the parse result of a particular non-terminal can be calculated without further
recursion. The word "curtailment" is somewhat misleading in this regard. Once this reframing is done
then a clearer view immediately follows.</p>
<p>What is the condition? Exactly as described above for direct left recursion. Through comparing
recursion counts with the amount of unconsumed input we determine that a result of no successful
parse can be calculated, and that the result is valid for reuse for any deeper recursion of the same
combinator at that input position.</p>
<p>From that can be derived:</p>
<ul>
<li>When merging two results that have different left recursion count conditions for the same
non-terminal, the larger count should be used</li>
<li>Conditions of subresults should also be made part of any result that includes those subresults
</li>
<li>Any memoized result is only reusable if all the recursion count conditions of the stored
result are less than or equal to the recursion counts for the current input position</li>
</ul>
<p>So far the above list just covers what is in the paper. But there is a little more that can be
inferred:</p>
<ul>
<li>If a result is not reusable and a new result is calculated, then the recursion count
conditions of the old result should be updated to the recursion counts at the current position and
applied to the new result</li>
<li>When the recursion count of a condition applied to a result plus the number of unconsumed
input tokens after the result is less than the number of input tokens available at the beginning
of the result, then that condition can be omitted from the result</li>
</ul>
<p>These two details should constitute a minor efficiency improvement.</p>
<h5>Further Work</h5>
<p>While the polynomial complexity of this library has been experimentally confirmed, no effort has
yet been made to prove that it is actually polynomial in the same way that the parser combinators in
the paper are. It is possible that due to the changes involved with using a non-functional language
and enabling piecewise parsing that some subtle complexity difference may have arisen.</p>
<p>Likewise, the piecewise parsing has been unit tested to a degree but no formal proof that it is
correct has been done.</p>
<p>Ideas like being able to modify and resume an erroneous parsing attempt would also be interesting
to explore.</p>
<p>Finally, the plan is to actually use this library for something significant at some point in the
future.</p>
{% endblock -%}
|