Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle
The P3 combinator parsing library
(Talk at OCaml 2014)
Dr Tom Ridge
2014-09-05
Main features
Parsing library (also has a parser generator component); interactive development
Can handle all CFGs (including left-recursive CFGs)
Scannerless (parse strings directly without lexing), or can use a lexer
Good theoretical basis e.g. sound and complete (see here for what these mean formally)
Good performance (for a general parser)
Can handle some fancy examples
Anti-feature: currently nowhere near as fast as deterministic parsers eg ocamlyacc
This talk
- We look at some examples, not at the theory
Example
- Consider the grammar
E -> E E E | "1" | eps
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
(parse_E ***> parse_E ***> parse_E)
|||| (a "1")
|||| (a "")))
i)
What does complete mean?
Completeness naively means: all parse trees are returned
What if there are an infinite number of parse trees?
Given a grammar, define a class of "good" parse trees. The class is finite (but potentially exponentially large).
"Good" parse trees are all you need - not in this talk.
Completeness means: all good parse trees are returned.
Example: parse trees
let rec parse_E = (fun i -> mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
|||| ((a "1") >>>> (fun _ -> `LF("1")))
|||| ((a "") >>>> (fun _ -> `LF(""))))
i)
Input size|Number of results (good parse trees)
0 |1
1 |1
2 |3
3 |19
4 |150
...
19 |441152315040444150
See here
Fact: computing results for inputs of size 19 or larger is not feasible
Note: exponential number of results means this must take (at least) an exponential amount of time
Note: this has nothing to do with parsing, it is due to the actions
Conceptual distinction: parsing is not applying actions
Parsing is different from applying the actions
Parsing can be done in polytime (e.g. using Earley). Results can be computed in polytime and stored in polyspace.
Applying actions takes how long? Well, it depends on the actions.
Applying actions over all good parse trees (where each action takes a constant time), takes an exponential amount of time
Example: counting
(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
Input size|Number of results
0 |1
1 |1
2 |1
4 |1
...
19 |1
Example: memoized counting (YCDTWAOP)
let rec parse_E =
let tbl_E = MyHashtbl.create 100 in
(fun i -> memo_p3 tbl_E (mkntparser "E" (
((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
|||| ((a "1") >>>> (fun _ -> 1))
|||| ((a "") >>>> (fun _ -> 0))))
i)
This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result
- The library has been designed to ensure "optimal" performance at each stage.
- If you want to write highly ambiguous grammars then parsing is still
O(n^3)
.
- The performance when applying actions can be optimized using memoization
Simple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time
What this means is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable
Example: disambiguation
- Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated
E -> E "+" E | E "-" E | ...
- One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link
<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>
<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>
<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>
- Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)
%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */
- Both these approaches can be used by P3 without any special support in the parser e.g. see here
Example: disambiguation (YCDTWAOP)
A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some
if the parse is acceptable, None
if the parse is not acceptable) link
Suppose input is 1*2+3
; this should be parsed as (1*2)+3
, not 1*(2+3)
E -> E "+" E {{ fun (x,(_,y)) -> (match x,y with
| (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
| (Some x,Some y) -> Some(Plus(x,y))
| _ -> None) }}
| E "*" E {{ fun (x,(_,y)) -> (match x,y with
| (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
| (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
| (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
| (Some x,Some y) -> Some(Times(x,y))
| _ -> None) }}
| ?num? {{ fun s -> Some(Num(int_of_string (content s))) }}
Example: modular combination of parsers (YCDTWAOP probably)
Consider grammars that are X (where X is LALR(1) or LR(1) or ...)
In general, combining two X grammars does not result in an X grammar. So you can't modularly specify and combine grammars from such classes without moving to a more general class.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here
(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y))
let rec parse_A h = (fun i -> mkntparser "arithexp" (
(((parse_A h) ***> (a "+") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
|||| (((parse_A h) ***> (a "-") ***> (parse_A h))
>>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
|||| (parse_num
>>>> (fun s -> `Num(int_of_string (content s))))
|||| (((a "(") ***> (parse_A h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B
for booleans, and parse_L
for lambda expressions
let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
(((a "\\") ***> parse_var ***> (parse_L h)) (* lam *)
>>>> (fun (_,(x,body)) -> `Lam(x,body)))
|||| (((parse_L h) ***> (parse_L h)) (* app *)
>>>> (fun (e1,e2) -> `App(e1,e2)))
|||| (parse_var >>>> (fun s -> `Var s)) (* var *)
|||| (((a "(") ***> (parse_L h) ***> (a ")")) (* brackets *)
>>>> (fun (_,(e,_)) -> e))
|||| h) (* helper parser *)
i)
let parse_U = (
let rec l i = parse_L h i
and a i = parse_A h i
and b i = parse_B h i
and h i = (l |||| a |||| b) i
in
l)
Arrrrggghhh! What about termination? Relax... the approach is complete, which means that you get all results, which means (inter alia) that the parsers always terminate
Then define an evaluator in the usual way, and finally, combine the parser and the evaluator
let parse_and_eval txt = (remove_err (
p3_run_parser
((parse_memo_U ()) >>>> eval empty_env)
txt))
let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt
Conclusion
It works and is usable for prototyping and small applications
The back-end Earley parser is highly engineered, and probably of use in other applications
I'm still unhappy with the combinator interface (eta expansion, naming of parsers)
I'm still working to improve real-world performance further
Hopefully the talk has shown you a few things you might not have seen before in any other parser library
Long term aim: general parser, verified correctness and performance, usable in the real-world
https://github.com/tomjridge/p3