Packrat parser combinator article added

author: Jed Barber <jjbarber@y7mail.com> 2021-02-02 17:25:27 +1100
committer: Jed Barber <jjbarber@y7mail.com> 2021-02-02 17:25:27 +1100
commit: 5265cb93699114e05fcea667597ea96cf3f4b2f4 (patch)
tree: 27cb03eae555b325e2b5f31c48001f5574225432
parent: 7337c888db6cb9adf2e26cbd85724ef9f760bbe0 (diff)
3 files changed, 161 insertions, 0 deletions
diff --git a/project/complexity.yml b/project/complexity.yml
index 8f680f4..6537139 100644
--- a/project/complexity.yml
+++ b/project/complexity.yml
@@ -21,4 +21,5 @@ unexpanded_templates:
  - "links.html"
  - "steelman.html"
  - "integral.html"
+ - "packrat.html"
 
diff --git a/project/templates/index.html b/project/templates/index.html
index dfe396b..3a66532 100644
--- a/project/templates/index.html
+++ b/project/templates/index.html
@@ -16,6 +16,9 @@
 {% block content %}
 
 <ul class="index">
+    <li><a href="/packrat.html">Packrat Parser Combinator Library</a><br>
+        <span class="post">(Posted 2/2/2021)</span></li>
+
     <li><a href="/integral.html">Area Under the Curve of a Complex Integral</a><br>
         <span class="post">(Posted 29/12/2018)</span></li>
 
diff --git a/project/templates/packrat.html b/project/templates/packrat.html
new file mode 100644
index 0000000..117c69a
--- /dev/null
+++ b/project/templates/packrat.html
@@ -0,0 +1,157 @@
+
+{% extends "base.html" %}
+
+
+
+{% block title %}Packrat Parser Combinator Library{% endblock %}
+
+
+
+{% block content %}
+
+
+<h4>Packrat Parser Combinator Library</h4>
+
+<p>Git repository: <a href="/cgit/cgit.cgi/packrat">Link</a><br>
+Paper this was based on: <a href="http://richard.myweb.cs.uwindsor.ca/PUBLICATIONS/PREPRINT_PADL_NOV_07.pdf">Link</a></p>
+
+<h5>2/2/2021</h5>
+
+
+<h5>Overview</h5>
+
+<p>Parser combinators are what you end up with when you start factoring out common pieces of functionality from
+<a href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">recursive descent parsing</a>. They are higher order
+functions that can be combined in modular ways to create a desired parser.</p>
+
+<p>However they also inherit the drawbacks of recursive descent parsing, and in particular recursive descent
+parsing with backtracking. If the grammar that the parser is designed to accept contains left recursion then
+the parser will loop infinitely. If the grammar is ambiguous then only one result will be obtained. And any result
+may require exponential time and space to calculate.</p>
+
+<p>This library, based on the paper linked at the top, solves all those problems and a few bits more. As an
+example, the following grammar portion:</p>
+
+<div class="precontain">
+<pre>
+Expression ::= Expression - Term | Term
+Term ::= Term * Factor | Factor
+Factor ::= ( Expression ) | Integer
+</pre>
+</div>
+
+<p>Can be turned into the following code snippet:</p>
+
+<div class="precontain">
+<code>
+package Expr_Redir is new Redirect;
+package Term_Redir is new Redirect;
+
+function Left_Parens is new Match ('(');
+function Right_Parens is new Match (')');
+function Fac_Expr is new Between (Left_Parens, Expr_Redir.Call, Right_Parens);
+function Fac_Choice is new Choice_2 (Fac_Expr, Integer_Num);
+function Factor is new Stamp (Factor_Label, Fac_Choice);
+
+function Multiply is new Match ('*');
+function Term_Mult is new Sequence (Term_Redir.Call'Access, Multiply'Access, Factor'Access);
+function Term_Choice is new Choice_2 (Term_Mult, Factor);
+function Term is new Stamp (Term_Label, Term_Choice);
+
+function Minus is new Match ('-');
+function Expr_Minus is new Sequence (Expr_Redir.Call'Access, Minus'Access, Term'Access);
+function Expr_Choice is new Choice_2 (Expr_Minus, Term);
+function Expression is new Stamp (Expression_Label, Expr_Choice);
+</code>
+</div>
+
+<p>Most of the verbosity is caused by the need to individually instantiate each combinator, as generics are
+used to serve the same purpose as higher order functions. Some bits are also omitted, such as the label
+enumeration and the actual setting of the redirectors. But the above should provide a good general
+impression.</p>
+
+
+<h5>Features</h5>
+
+<p>A list of features that this library provides includes:</p>
+<ul>
+  <li>Higher order combinator functions in Ada, a language that does not support functional programming</li>
+  <li>Both parser combinators and simpler lexer combinators are available</li>
+  <li>Input can be any array, whether that is text strings or otherwise</li>
+  <li>Left recursive grammars are parsed correctly with no infinite loops</li>
+  <li>Ambiguity is handled by incorporating all possible valid options into the resulting parse tree</li>
+  <li>Parsing and lexing can both be done piecewise, providing input in successive parts instead of all at once</li>
+  <li>Error messages are generated when applicable that note what would have been needed and where for a successful parse</li>
+  <li>All of the above is accomplished in polynomial worst case time and space complexity</li>
+</ul>
+
+<p>More thorough documentation is provided in the <em>/doc</em> directory.</p>
+
+<p>The name of the library comes from <a href="https://bford.info/pub/lang/packrat-icfp02.pdf">packrat parsing</a>
+which is a parsing algorithm that avoids exponential time complexity by memoizing all intermediate results. As that
+is what this library does, both so as to reduce the time complexity and to enable piecewise parsing, the name
+seemed appropriate.</p>
+
+
+<h5>Left Recursion</h5>
+
+<p>Direct left recursion, meaning a grammar non-terminal that immediately recurses to itself on the left as in
+the <em>Expression</em> or <em>Term</em> used above, is fairly easy to handle. A counter is used to keep track
+of how many times a particular combinator has been applied to the input at a particular position, and when that
+counter exceeds the number of unconsumed input tokens plus one the result is curtailed. This is explained on
+pages 7 and 8 of the paper.</p>
+
+<p>The really thorny issue that caused the most problems with this library is indirect left recursion. This is
+when a non-terminal recurses to itself on the left only after first evaluating to one or more other non-terminals.
+Curtailment in these circumstances can easily cause those other non-terminals to also be curtailed, and reusing
+the results for those other non-terminals may be incorrect. This issue along with a proposed solution is
+explained on page 9 of the paper. However that solution was not as clear as would be preferred. So some minor
+rephrasing and reworking was needed.</p>
+
+<p>Bringing this problem back to the start: What are we really doing when we curtail a result due to left
+recursion? It is not a matter of cutting off branches in a parse tree. We are identifying conditions where the
+parse result of a particular non-terminal can be calculated without further recursion. The word "curtailment" is
+somewhat misleading in this regard. Once this reframing is done then a clearer view immediately follows.</p>
+
+<p>What is the condition? Exactly as described above for direct left recursion. Through comparing recursion counts
+with the amount of unconsumed input we determine that a result of no successful parse can be calculated, and that
+the result is valid for reuse for any deeper recursion of the same combinator at that input position.</p>
+
+<p>From that can be derived:</p>
+<ul>
+  <li>When merging two results that have different left recursion count conditions for the same non-terminal,
+  the larger count should be used</li>
+  <li>Conditions of subresults should also be made part of any result that includes those subresults</li>
+  <li>Any memoized result is only reusable if all the recursion count conditions of the stored result are less
+  than or equal to the recursion counts for the current input position</li>
+</ul>
+
+<p>So far the above list just covers what is in the paper. But there is a little more that can be inferred:</p>
+<ul>
+  <li>If a result is not reusable and a new result is calculated, then the recursion count conditions of the
+  old result should be updated to the recursion counts at the current position and applied to the new result</li>
+  <li>When the recursion count of a condition applied to a result plus the number of unconsumed input tokens after
+  the result is less than the number of input tokens available at the beginning of the result, then that
+  condition can be omitted from the result</li>
+</ul>
+
+<p>These two details should constitute a minor efficiency improvement.</p>
+
+
+<h5>Further Work</h5>
+
+<p>While the polynomial complexity of this library has been experimentally confirmed, no effort has yet been
+made to prove that it is actually polynomial in the same way that the parser combinators in the paper are. It
+is possible that due to the changes involved with using a non-functional language and enabling piecewise
+parsing that some subtle complexity difference may have arisen.</p>
+
+<p>Likewise, the piecewise parsing has been unit tested to a degree but no formal proof that it is correct has
+been done.</p>
+
+<p>Ideas like being able to modify and resume an erroneous parsing attempt would also be interesting to explore.</p>
+
+<p>Finally, the plan is to actually use this library for something significant at some point in the future.</p>
+
+
+{% endblock %}
+
author	Jed Barber <jjbarber@y7mail.com>	2021-02-02 17:25:27 +1100
committer	Jed Barber <jjbarber@y7mail.com>	2021-02-02 17:25:27 +1100
commit	5265cb93699114e05fcea667597ea96cf3f4b2f4 (patch)
tree	27cb03eae555b325e2b5f31c48001f5574225432
parent	7337c888db6cb9adf2e26cbd85724ef9f760bbe0 (diff)