From 4b1face73de556041fea86c50e35e0ffaa3e8c74 Mon Sep 17 00:00:00 2001 From: Jed Barber Date: Wed, 22 Apr 2020 00:29:31 +1000 Subject: Revamped notes, formulated generic directed graph API --- packrat_parser_lib_notes.txt | 513 +++++++++++++++++++++---------------------- 1 file changed, 252 insertions(+), 261 deletions(-) diff --git a/packrat_parser_lib_notes.txt b/packrat_parser_lib_notes.txt index 2a8e0cc..67781cd 100644 --- a/packrat_parser_lib_notes.txt +++ b/packrat_parser_lib_notes.txt @@ -2,20 +2,20 @@ package structure: +Ada.Containers.Directed_Graphs + Packrat -Packrat.Parser (generic over memoize enum, input item type, array of input items, graph of output item subarrays) -Packrat.Parser.Combinators (merged into Packrat.Parser) -Packrat.Lexer (generic over stamp enum, input item type, array of input items, array of output items wrapped as tokens) -Packrat.Lexer.Combinators (merged into Packrat.Lexer) -Packrat.Util Packrat.Errors (nested) -Packrat.Interfaces (nested) -Packrat.Graphs (generic over leaf array type) Packrat.Tokens (nested, generic over contained array) +Packrat.Util +Packrat.Parse_Graphs (generic over memoize enum, tokens) +Packrat.Lexer (generic over stamp enum, input item type, array of input items, array of output items wrapped as tokens) +Packrat.Parser (generic over memoize enum, input item type, array of input items, graph of output item subarrays) Packrat.Text -Packrat.Text.Std (nested, generic over parser/lexer label enums) -Packrat.Text.Wide (nested, generic over parser/lexer label enums) -Packrat.Text.Wide_Wide (nested, generic over parser/lexer label enums) +Packrat.Text.No_Lex (generic over parser label enum) +Packrat.Text.Standard (generic over parser/lexer label enums) +Packrat.Text.Wide (generic over parser/lexer label enums) +Packrat.Text.Wide_Wide (generic over parser/lexer label enums) Ratnest.Tests Ratnest.Examples @@ -32,13 +32,16 @@ planned order of writing: (Ratnest not mentioned since that's where all the testing functions will go) Packrat.Util +Packrat Packrat.Errors Packrat.Tokens Packrat.Lexer -Packrat.Graphs +Ada.Containers.Directed_Graphs +Packrat.Parse_Graphs Packrat.Parser Packrat.Text -Packrat.Text.Std +Packrat.Text.No_Lex +Packrat.Text.Standard Packrat.Text.Wide Packrat.Text.Wide_Wide Calculator @@ -52,18 +55,245 @@ Off_Side Packrat - main package - - defines parser_component_function, lexer_component_function, syntax_tree, token_list, parser_context types - - parser_context type is completely opaque and has an empty_context constant - - syntax_tree is a tagged type container - - token_list is just an array (so can be any sort of string, or an array of tokens produced by a lexer, etc) - - is it possible to define parser/lexer components such that an instantiated main parsing/lexing function qualifies? - - probably not, since the main parsing/lexing functions will be able to throw exceptions for parsing/lexing errors, - whereas the component functions will only pass a failure up the chain as part of the context - - possible usage modes include (token_list -> lexer -> token_list -> parser -> syntax_tree) and (token_list -> - parser -> syntax_tree) + - defines exceptions, some common enumeration types that are used throughout + - contains Errors and Tokens as nested packages + - has some protected utility functions to convert between String and Unbounded_String types + - intended usage mode is (token_array -> lexer -> token_array -> parser -> parse_graph) although + it is also possible for (token_array -> parser -> parse_graph) + +List of types: + Result_Status + Parser_Error + Lexer_Error + + + + +Packrat.Errors + - nested package, as this functionality is important to parsers/lexers + - functions to handle and process exception messages + - exception messages take the form of one or more "sp" substrings joined together, + with each symbol name being in all capitals and each position being a positive integer + - this message represents the list of expected symbols at particular positions that would have resulted in a + more complete parse + - one of these messages will be raised with an exception if no valid parse is possible with a given input + +List of datatypes: + Error_Message (the errors encoded into a String) + Error_Info (containing a string of the symbol name expected, and a natural of the position) + Error_Info_Array List of funcs: -(primitive operations for syntax_tree) + Valid_Message + Valid_Identifier + Valid_Identifier_Array + Debug_String + +Join + Encode + Encode_Array + Decode +(the message format ensures that using "&" to join messages will still result in a valid message) + + + + +Packrat.Tokens + - nested package, defines a datatype important throughout lexing/parsing + - generic over the array type of whatever is being lexed/parsed and the enum of valid token labels + - contains an enum identifier, the start position, and the token value + - cannot contain the finish position of a token because that would make it impossible + to avoid exponential blowup in parse graphs for ambiguous grammars + +List of datatypes: + Token + +List of funcs: + Create + Debug_String + + Label + Value + Start + + + + +Packrat.Util + - contains predicates to use in Satisfy parser components and similar + +List of funcs: + In_Set + Not_In_Set + Is_Digit + Is_Hex + Is_Letter + Is_Alphanumeric + Is_Punctuation + Is_ASCII + Is_Extended_ASCII + Is_Space + Is_Linespace + Is_End_Of_Line + Is_Whitespace + Not_Whitespace + + + + +Packrat.Lexer + - generic over the enum used for labeling lexemes, the input element list, and an instantiated Packrat.Tokens + - contains all datatypes and functions for lexing + +List of types: + Combinator + Combinator_Result + Combinator_Array + Lexer_Context + Component + Component_Result + Component_Array + With_Input + +List of funcs: +(Component functions) + Stamp (will store a token for the input recognised by its combinators) + Ignore (won't store a token) + +(Lexing functions) + Scan (can be fed input in chunks, will keep continuing until fed a zero-length array) + Scan_Only (will assume that the input given is all there is, without considering the possibility of more input) + Scan_With (takes a function that supplies input when called, processes input the same as Scan) + Scan_Set (fed input in constant size chunks with a specified padding character) + Scan_Set_With (a combination of the functionality of Scan_Set and Scan_With) + +(Combinator functions) + Sequence + Count + Many + Many_Until + + Satisfy + Satisfy_With + Match + Match_With + Multimatch + Take + Take_While + Take_Until + + Line_End + Input_End + + + + +Ada.Containers.Directed_Graphs + - generic over discrete node type, array of node type, node label type, edge label type + - implemented using the adjacancy list method + - allows both nodes and edges to be labeled, one arbitrary type for each + - nodes and edges don't *have* to be labeled + - API in the style of the Ada Containers library + - inspiration also taken from Haskell Data.Graph.Inductive.Graph + +List of types: + Extended_Node_Type + Edge_Type + Edge_Type_Array + Path (derived from the node array type) + Graph + Cursor + Node_Label_Reference + Constant_Node_Label_Reference + Edge_Label_Reference + Constant_Edge_Label_Reference + +Constants: + No_Node (of Extended_Node_Type) + No_Element (of Cursor) + Empty_Graph (of Graph) + +List of Cursor funcs: + Has_Element + Element (cursor -> node) + + Equal_Subgraph + Subgraph_Node_Count + + First_Parent + Next_Parent + Previous_Parent + Last_Parent + First_Child + Next_Child + Previous_Child + Last_Child + +List of Graph funcs: + "=" + + Assign + Copy + Move + + Is_Empty + Clear + To_Cursor (node -> cursor) + + Node_Count (a measure of number of nodes in the graph) + Edge_Count (a measure of number of edges in the graph) + + Nodes (array of all nodes in the graph) + Edges (array of all edges in the graph) + Node_Range (returns minimum and maximum of nodes in the graph) + Unused_Nodes (returns an array of a specified number of nodes not used in the graph) + + Insert (add a node at a given position (with label), add an edge (with label), add multiple nodes, multiple edges) + Append (add a node at the next available position (with label), return the node added) + Delete (remove a node, an edge, multiple nodes, multiple edges, from the graph) + Append_Label (give a node or an edge a label) + Replace_Label + Delete_Label (remove a label from a node or an edge) + Delete_Subgraph + Swap (switches two nodes in the graph) + + Context (returns all parents of a node, all children of the node) + Labeled_Context (returns all parents of a node, the label of the node, and all the children of the node) + Has_Label (predicates for testing if a node or edge has a label in the graph) + Label (returns the label of a node or an edge if it has one) + Label_Reference (accessor to a label) + Constant_Label_Reference (constant accessor) + Neighbors (two nodes are neighbors if they have edges in both directions) + Parents (nodes with edges going into a node) + Children (nodes with edges coming from a node) + Outbound (edges going out of a node) + Inbound (edges coming into a node) + Outdegree (outbound degree of a node) + Indegree (inbound degree of a node) + Degree (degree of a node; how many other nodes are connected) + Has_Edge (check whether two nodes are connected) + Has_Labeled_Edge (check whether two nodes are connected with a labeled edge) + Has_Neighbor (check whether two nodes are neighbors) + + Find (find a node or edge with a particular label) + Find_In_Subgraph + Contains (check whether a node or an edge is in the graph) + Contains_Label (check if graph contains a particular node/edge label) + + Iterate + Iterate_Subgraph + Iterate_Children + Iterate_Parents + + + + +Packrat.Parse_Graphs + - generic over an instantiated Packrat.Tokens and associated types + - sets up the above directed graph lib for use as a parse graph that will avoid exponential blowup + - provides extra functions to merge graphs and iterate specific parsing choices + + @@ -152,245 +382,6 @@ End_Of_Input -Packrat.Lexer - - generic over the enum used for labeling lexemes and the input token_list as well as component token elements - - should be possible to place an upper limit on the number of tokens scanned, so as to accommodate a statically - sized output array of tokens (and possibly a statically sized input array) - -List of datatypes: -Combinator -Combinator_Result - -List of funcs: -(for Combinator_Results) -Create_Result -Join -Status - -(each of these is generic over an array of lexer_component functions, either Stamp or Ignore as below) -Scan - - function that returns an array of lexed tokens - - uses first applicable lexer component to lex each token - - if all lexer components return "partial" then also returns with a partial status - - if all lexer components fail then raises a lexer_error - - lexer status can be fed back into any of these Scan functions to resume with further input - - if end of input is reached without any lexer components returning "partial" then returns - a complete status and feeding the lexer status back into these functions will be the same - as starting with a blank status -Scan_Set - - as above, except is a procedure that uses a fixed size array as output with a padding token -Scan_Only - - function that returns an array of lexed tokens - - takes a lexer status as input to resuem a lex, but will treat it as a constant unlike the others - - if all lexer components return "partial" or fail then raises a lexer_error -Scan_With - - function that returns an array of lexed tokens - - when it runs out of input it uses the supplied function to get more input until that function - returns an empty array - - may also return with a partial status as with Scan -Scan_Set_With - - as above, except is a procedure that uses a fixed size array as output with a padding token - -(determined to be redundant) -Scan_Set_Only - -(type signature of these are: - input of an opaque lex_component_input type - output of an opaque lex_component_output type - return of a Fail/Partial/Success result enum) -Stamp - - one of two available lexer component functions - - generic over a value of the enum used for labelling lexemes and a lexer combinator -Ignore - - the other available lexer component function - - generic over a lexer combinator, as this function will scan for the given token and then ignore it - - - - -Packrat.Lexer.Combinators - -type signature of these are: - inputs of input_array and a starting position index - outputs of the number of elements consumed and the value lexed - return of a Fail/Partial/Success result enum - -List of funcs: -Sequence -Count -Many -Many_Until - -Satisfy -Satisfy_With -Match -Match_With -Multimatch -Take -Take_While -Take_Until - -Line_End -Input_End - - - - -Packrat.Util - - contains predicates to use in Satisfy parser components and similar - - has a few string output functions for syntax_tree objects - -List of funcs: -In_Set -Not_In_Set -Is_Digit -Is_Hex -Is_Letter -Is_Alphanumeric -Is_Punctuation -Is_ASCII -Is_Extended_ASCII -Is_Space -Is_Linespace -Is_End_Of_Line -Is_Whitespace -Not_Whitespace - - - - -Packrat.Error -(actually a nested package, as this functionality is important to parsers/lexers) - - functions to handle and process exception messages - - exception messages take the form of one or more "sp" substrings joined together, - with each symbol name being in all capitals and each position being a positive integer - - this message represents the list of expected symbols at particular positions that would have resulted in a - more complete parse - - one of these messages will be raised with an exception of no valid parse is possible with a given input - -List of datatypes: -Error_Message -Error_Info (containing a string of the symbol name expected, and a natural of the position) -Error_Info_Array - -List of funcs: -Valid_Message -Valid_Identifier -Valid_Identifier_Array -Debug_String - -Join -Encode -Encode_Array -Decode -(the message format ensures that using "&" to join messages will still result in a valid message) - - - - -Packrat.Tokens - - nested package, defines a datatype important throughout lexing/parsing - - generic over the array type of whatever is being lexed/parsed and the enum of valid token labels - - contains an enum identifier, the start position, the finish position plus one, and the token value - -List of datatypes: -Token (tagged, controlled, but not limited) - -List of funcs: -Create -Initialized -Debug_String - -Label -Value -Start -Finish - - - - -Packrat.Interfaces - - nested package, defines interfaces and types that should be used to derive an abstract syntax graph - - some of these may be removed from Interfaces and just implemented in Graphs depending on what actually - ends up being necessary for parsing - -List of interfaces: -Node - - Leaf (constructor) - - Branch (constructor) - - Is_Leaf - - Is_Branch - - Label - - Elements - - Start - - Finish - -Cursor - - Is_Nothing - - Depth - - Is_Node - - Is_Root - - Is_Branch - - Is_Leaf - - Label - - Elements - - Start - - Finish - - Choices - - Parent - - Child_Count - - All_Child_Count - - First_Child - - Last_Child - - Next_Sibling - - Prev_Sibling - - Delete_Children - - Delete_All_Children - - Equal_Subgraph - - Subgraph_Node_Count - - Find_In_Subgraph - -Directed_Acyclic_Graph - - Contains - - Singleton (constructor) - - Is_Empty - - Is_Ambiguous - - Node_Count - - Root_Count - - Root - - Append - - Prepend - - Attach_Choice - - Clear - - Delete_Position - - Find - - - - -Packrat.Graphs - - builds on the interfaces in Packrat.Interfaces - - can be replaced with a user-supplied type, assuming proper care is taken to avoid exponential issues - -List_of_datatypes: -Node (implements all of Interfaces.Node) -Node_Reference -Cursor (implements all of Interfaces.Cursor) -Iter_Cursor (wraps a Cursor, necessary for iteration) -Parse_Graph (implements all of Interfaces.Directed_Acyclic_Graph) -Choosing_Function (used in Iterate_Choice) - -List of funcs: -Debug_String - -Node_At - -Is_Valid_Node -Iterate -Iterate_Subtree -Iterate_Choice - - -- cgit