Commit bcac1c43 by POTTIER Francois

### Some progress on the blog post.

parent 08ee3403
 ... ... @@ -51,3 +51,65 @@ automaton. For more details, please consult the paper [Regular-expression derivatives re-examined](https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf) by Scott Owens, John Reppy and Aaron Turon. In particular, Definition 4.1 in that paper gives a number of equations that must be exploited when deciding whether two regular expressions are equal. In the following, I refer to these equations collectively as **EQTH**, for *equational theory*. [The complete code](http://gitlab.inria.fr/fpottier/fix/demos/brz/Brzozowski.ml) for this demo is also available. ## An alphabet Throughout, I assume that the alphabet is given by a module `Char` whose signature is as follows: ``` Char : sig type t val equal: t -> t -> bool val hash: t -> int val foreach: (t -> unit) -> unit val print: t -> string end ``` The fact that this alphabet is finite is witnessed by the existence of the function `Char.foreach`, which enumerates all characters. As an exercise for the reader, this can be used to define an auxiliary function `exists_char` of type `(Char.t -> bool) -> bool`. ## Regular expressions, hash-consed In OCaml, the syntax of regular expressions is naturally described by an algebraic data type `regexp`. A slight twist is that expressions are [**hash-consed**](https://en.wikipedia.org/wiki/Hash_consing). That is, every expression is decorated with an integer identifier, and these identifiers are unique: two expressions are equal if and only if they carry the same identifier. (This notion of equality takes **EQTH** into account.) This allows efficiently testing whether two expressions are equal. This also allows building efficient dictionaries whose keys are expressions, or, in other words, efficient memoized functions of type `regexp -> ...`. This is heavily exploited in the code that follows: the functions `nullable`, `delta`, and `nonempty` are three examples, and there are more. The syntax of regular expressions is the same as in [Owens et al.'s paper](https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf), except I use n-ary disjunctions and conjunctions. This is dictated by the need to normalize terms with respect to the equations in Definition 4.1. The equations state that disjunction and conjunction are associative, commutative, idempotent, and have a unit (ACIU). In other words, a disjunction must be viewed as a set of disjuncts, and a conjunction must be viewed as a set of conjuncts. For this reason, the data constructors `EDisj` and `EConj` carry a list of subexpressions. This list is normalized in such a way that, if two lists are equal as sets, then they are equal as lists, too. The list carried by `EDisj` and `EConj` is never a singleton list. It can be an empty list: `EDisj []` is the empty regular expression `zero`, while `EConj []` is the universal regular expression `one`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!