Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle
The P3 combinator parsing library
(Talk at OCaml 2014)

Dr Tom Ridge

2014-09-05

Main features

Parsing library (also has a parser generator component); interactive development
Can handle all CFGs (including left-recursive CFGs)
Scannerless (parse strings directly without lexing), or can use a lexer
Good theoretical basis e.g. sound and complete (see here for what these mean formally)
Good performance (for a general parser)
Can handle some fancy examples
Anti-feature: currently nowhere near as fast as deterministic parsers eg ocamlyacc

This talk

We look at some examples, not at the theory

Example

Consider the grammar E -> E E E | "1" | eps

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  (parse_E ***> parse_E ***> parse_E)
  |||| (a "1")
  |||| (a "")))
  i)

left-recursive
highly-ambiguous
infinite number of parse trees even with the empty string as input

What does complete mean?

Completeness naively means: all parse trees are returned
What if there are an infinite number of parse trees?
Given a grammar, define a class of "good" parse trees. The class is finite (but potentially exponentially large).
"Good" parse trees are all you need - not in this talk.
Completeness means: all good parse trees are returned.

Example: parse trees

let rec parse_E = (fun i -> mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> `Node(x,y,z)))
  |||| ((a "1") >>>> (fun _ -> `LF("1")))
  |||| ((a "") >>>> (fun _ -> `LF(""))))
  i)

Input size|Number of results (good parse trees)
0         |1                                   
1         |1                                   
2         |3  
3         |19
4         |150                                
...
19        |441152315040444150

See here
Fact: computing results for inputs of size 19 or larger is not feasible
Note: exponential number of results means this must take (at least) an exponential amount of time
Note: this has nothing to do with parsing, it is due to the actions

Conceptual distinction: parsing is not applying actions

Parsing is different from applying the actions
Parsing can be done in polytime (e.g. using Earley). Results can be computed in polytime and stored in polyspace.
Applying actions takes how long? Well, it depends on the actions.
Applying actions over all good parse trees (where each action takes a constant time), takes an exponential amount of time

Example: counting

(* Grammar: E -> E E E | "1" | eps *)
let rec parse_E = (fun i -> (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

Input size|Number of results
0         |1                                   
1         |1                                   
2         |1                                   
4         |1                                 
...       
19        |1

The semantics is as before: compute the actions over all the (good) parse trees
This takes an exponential amount of time. Does it need to?

Example: memoized counting (YCDTWAOP)

let rec parse_E = 
  let tbl_E = MyHashtbl.create 100 in
  (fun i -> memo_p3 tbl_E (mkntparser "E" (
  ((parse_E ***> parse_E ***> parse_E) >>>> (fun (x,(y,z)) -> x+y+z))
  |||| ((a "1") >>>> (fun _ -> 1))
  |||| ((a "") >>>> (fun _ -> 0))))
  i)

This is polytime - inputs of length 100 (!!!) take a few seconds to process and return a single result
The library has been designed to ensure "optimal" performance at each stage.
- If you want to write highly ambiguous grammars then parsing is still O(n^3).
- The performance when applying actions can be optimized using memoization
Simple semantics: compute actions over all (good) parse trees; there are exponentially many such parse trees, but this doesn't have to take exponential time
What this means is that, providing your actions don't cause exponentially many results to be returned, performance is often pretty reasonable

Example: disambiguation

Arithmetic expressions are pretty basic, but parsing and disambiguating is somewhat complicated

E -> E "+" E | E "-" E | ...

One approach: fiddle with the grammar (this uses CFG structure to encode associativity and precedence) link

<Exp> ::= <Exp> + <Term> | <Exp> - <Term> | <Term>

<Term> ::= <Term> * <Factor> | <Term> / <Factor> | <Factor>

<Factor> ::= x | y | ... | ( <Exp> ) | - <Factor>

Another approach: add associativity and precedence directly to the parser link; the following expresses associativity and precedence (later declarations have higher precedence)

%left PLUS MINUS
%left MULTIPLY DIVIDE
%left NEG /* negation -- unary minus */
%right CARET /* exponentiation */

Both these approaches can be used by P3 without any special support in the parser e.g. see here

Example: disambiguation (YCDTWAOP)

A general approach: rewrite parse trees (or, less generally, throw away ones you don't want)
The following is for right-assoc + and *; we return an option (Some if the parse is acceptable, None if the parse is not acceptable) link
Suppose input is 1*2+3; this should be parsed as (1*2)+3, not 1*(2+3)

E -> E "+" E           {{ fun (x,(_,y)) -> (match x,y with 
                          | (Some(Plus _),Some _) -> None (* not (x+y)+z ! *)
                          | (Some x,Some y) -> Some(Plus(x,y)) 
                          | _ -> None) }}

  | E "*" E            {{ fun (x,(_,y)) -> (match x,y with
                          | (Some (Times _),Some _) -> None (* not (x*y)*z ! *)
                          | (Some (Plus _),Some _) -> None (* not (x+y)*z ! *)
                          | (Some x,Some(Plus _)) -> None (* not x*(y+z) ! <-- *)
                          | (Some x,Some y) -> Some(Times(x,y)) 
                          | _ -> None) }}
  | ?num?              {{ fun s -> Some(Num(int_of_string (content s))) }}

Directly encodes which results are acceptable and not acceptable
Traditionally not possible because there are exponentially many parse trees; the point of last few slides is: it is not the number of parse trees that matters
- eg 5+3*2*1+4+1+1+1+1+1+1+1+1+1+1 has an awful lot of parse trees, but can be parsed in a fraction of a second to give a single result
Not only usable for arithmetic expressions...

Example: modular combination of parsers (YCDTWAOP probably)

Consider grammars that are X (where X is LALR(1) or LR(1) or ...)
In general, combining two X grammars does not result in an X grammar. So you can't modularly specify and combine grammars from such classes without moving to a more general class.
In contrast, two CFGs can be combined to form a CFG. So you can modularly specify and combine such grammars.
Example modular specification and combination of parsers and evaluators for arithmetic, boolean expressions, and lambda calculus here

(* w is (possibly zero length) whitespace *)
let ( ***> ) x y = (x ***> w ***> y) >>>> (fun (x,(_,y)) -> (x,y)) 

let rec parse_A h = (fun i -> mkntparser "arithexp" (
  (((parse_A h) ***> (a "+") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Plus(e1,e2)))
  |||| (((parse_A h) ***> (a "-") ***> (parse_A h))
        >>>> (fun (e1,(_,e2)) -> `Minus(e1,e2)))
  |||| (parse_num 
        >>>> (fun s -> `Num(int_of_string (content s))))
  |||| (((a "(") ***> (parse_A h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Note the presence of a helper parser...; this parses "all the other types of expressions"
Define parse_B for booleans, and parse_L for lambda expressions

let rec parse_L h = (fun i -> mkntparser "lambdaexp" (
  (((a "\\") ***> parse_var ***> (parse_L h))  (* lam *)
        >>>> (fun (_,(x,body)) -> `Lam(x,body)))
  |||| (((parse_L h) ***> (parse_L h))  (* app *)
        >>>> (fun (e1,e2) -> `App(e1,e2)))
  |||| (parse_var >>>> (fun s -> `Var s))  (* var *)
  |||| (((a "(") ***> (parse_L h) ***> (a ")"))  (* brackets *)
        >>>> (fun (_,(e,_)) -> e))
  |||| h)  (* helper parser *)
  i)

Can use parsers separately (with a "do nothing" helper)
Or can combine to form a parser for the union of the languages; in the following notice the definition of the helper function

let parse_U = (
  let rec l i = parse_L h i 
  and a i = parse_A h i 
  and b i = parse_B h i
  and h i = (l |||| a |||| b) i
  in
  l)

Arrrrggghhh! What about termination? Relax... the approach is complete, which means that you get all results, which means (inter alia) that the parsers always terminate
Then define an evaluator in the usual way, and finally, combine the parser and the evaluator

let parse_and_eval txt = (remove_err (
  p3_run_parser 
    ((parse_memo_U ()) >>>> eval empty_env)
    txt))


let y = "(\\ f ((\\ x (f (x x))) (\\ x (f (x x)))))"
(* sigma; let rec sigma x = if x < 2 then 1 else x+(sigma (x-1)) *)
let sigma = "(\\ g (\\ x (if (x < 2) then 1 else (x+(g (x-1))))))"
(* following is a lambda calc version of the sigma function, applied to argument 5 *)
let txt = "("^y^" "^sigma^") 5"
let [r] = parse_and_eval txt

Performance

Compare P3 to Happy on 5 sample small grammars
On 2, P3 is much much better

E_EEE: E -> E E E | "1" | ""

Grammar|Input size|Happy/s|P3/s
E_EEE  |020       |0.14   |0.10
E_EEE  |040       |6.87   |0.13
E_EEE  |100       |2535.88|0.50

On 2, P3 is better, but only by a factor of 10 or so. On one, Happy appeared to loop for non-empty inputs.
See here for comparison with Happy
The back-end Earley parser took some time to get functionally correct, but a staggering amount of time to hit the performance requirement O(n^3)

Conclusion

It works and is usable for prototyping and small applications
The back-end Earley parser is highly engineered, and probably of use in other applications
I'm still unhappy with the combinator interface (eta expansion, naming of parsers)
I'm still working to improve real-world performance further
Hopefully the talk has shown you a few things you might not have seen before in any other parser library
Long term aim: general parser, verified correctness and performance, usable in the real-world
https://github.com/tomjridge/p3

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracleThe P3 combinator parsing library (Talk at OCaml 2014)

Main features

This talk

Example

What does complete mean?

Example: parse trees

Conceptual distinction: parsing is not applying actions

Example: counting

Example: memoized counting (YCDTWAOP)

Example: disambiguation

Example: disambiguation (YCDTWAOP)

Example: modular combination of parsers (YCDTWAOP probably)

Performance

Conclusion

Simple, efficient, sound and complete combinator parsing for all context-free grammars, using an oracle
The P3 combinator parsing library
(Talk at OCaml 2014)