Commit 8fad94a2 authored by POTTIER Francois's avatar POTTIER Francois

Progress on the blog post.

parent a5c7afee
......@@ -57,6 +57,10 @@ when deciding whether two regular expressions are equal.
In the following,
I refer to these equations collectively as **EQTH**,
for *equational theory*.
Among other things, these equations state that disjunction
is associative, commutative, and idempotent.
In other words, a disjunction must be viewed as a set of disjuncts.
The empty regular expression can be viewed as an empty disjunction.
[The complete code](http://gitlab.inria.fr/fpottier/fix/demos/brz/Brzozowski.ml)
for this demo is also available.
......@@ -84,32 +88,73 @@ function `exists_char` of type `(Char.t -> bool) -> bool`.
## Regular expressions, hash-consed
In OCaml, the syntax of regular expressions is naturally described by an
algebraic data type `regexp`.
The syntax of regular expressions (expressions, for short) is naturally
described by an algebraic data type `regexp`.
It is the same syntax as in
[Owens et al.'s paper](https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf),
except I use n-ary disjunctions and conjunctions. As explained above, this is
dictated by the need to take EQTH into account. The data constructors `EDisj`
and `EConj` carry a list of subexpressions. This list is normalized in such a
way that, if two lists are equal as sets, then they are equal as lists, too.
This list is never a singleton list. It can be empty: `EDisj []` is the empty
expression `zero`, while `EConj []` is the universal expression `one`.
```
type regexp =
skeleton HashCons.cell
and skeleton =
| EEpsilon
| EChar of Char.t
| ECat of regexp * regexp
| EStar of regexp
| EDisj of regexp list
| EConj of regexp list
| ENeg of regexp
```
A slight twist is that expressions are
[**hash-consed**](https://en.wikipedia.org/wiki/Hash_consing). That is, every
expression is decorated with an integer identifier, and these identifiers are
unique: two expressions are equal if and only if they carry the same
identifier. (This notion of equality takes **EQTH** into account.) This allows
efficiently testing whether two expressions are equal. This also allows
building efficient dictionaries whose keys are expressions, or, in other
words, efficient memoized functions of type `regexp -> ...`. This is heavily
exploited in the code that follows: the functions `nullable`, `delta`, and
`nonempty` are three examples, and there are more.
expression is decorated with an integer identifier. (A record of type
`skeleton HashCons.cell` is a pair of an integer identifier and a skeleton.)
These identifiers are unique: two expressions are equal if and only if they
carry the same identifier. (This notion of equality takes EQTH into account.)
This allows testing very quickly whether two expressions are equal. This also
allows building efficient dictionaries whose keys are expressions, or in
other words, efficient memoized functions of type `regexp -> ...`. This is
heavily exploited in the code that follows: the functions `nullable`, `delta`,
and `nonempty` are three examples, and there are more.
<!-- `N.encode`, inside `dfa`, is another example. -->
The syntax of regular expressions is the same as in
[Owens et al.'s paper](https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf),
except I use n-ary disjunctions and conjunctions. This is dictated by the need
to normalize terms with respect to the equations in Definition 4.1. The
equations state that disjunction and conjunction are associative, commutative,
idempotent, and have a unit (ACIU). In other words, a disjunction must be
viewed as a set of disjuncts, and a conjunction must be viewed as a set of
conjuncts. For this reason, the data constructors `EDisj` and `EConj` carry a
list of subexpressions. This list is normalized in such a way that, if two
lists are equal as sets, then they are equal as lists, too.
The list carried by `EDisj` and `EConj` is never a singleton list. It can be
an empty list: `EDisj []` is the empty regular expression `zero`, while `EConj
[]` is the universal regular expression `one`.
The module
[HashCons](https://gitlab.inria.fr/fpottier/fix/blob/master/src/HashCons.mli),
which is part of
[Fix](https://gitlab.inria.fr/fpottier/fix/),
provides facilities for hash-consing.
(The reader is encouraged to take a look at
[its implementation:](https://gitlab.inria.fr/fpottier/fix/blob/master/src/HashCons.ml)
it is very short.)
It is used as follows:
```
let make : skeleton -> regexp =
let module H = HashCons.ForHashedType(struct
(* Define equality and hashing of skeletons. *)
type t = skeleton
let equal = ...
let hash = ...
end)
in H.make
let skeleton : regexp -> skeleton =
HashCons.data
```
The function `make` is where the magic takes place: whenever we wish to
construct a new expression, we construct just a skeleton and pass it to
`make`, which takes care of determining whether this skeleton is already
known (in which case an existing integer identity is re-used) or is new
(in which case a fresh integer identity is allocated).
The function `skeleton`, which converts between an expression and a
skeleton in the reverse direction, is just a pair projection.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment