Commit 195c19d7 authored by POTTIER Francois's avatar POTTIER Francois

Beginning of a blog post.

parent d2ea8627
# Been there, done that: from REs to DFAs
<!-- TEMPORARY title -->
There are several ways of compiling
a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) (RE)
down to a
[deterministic finite-state automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) (DFA).
One such way is based on
[Brzozowski derivatives](https://en.wikipedia.org/wiki/Brzozowski_derivative)
of regular expressions.
In this post,
I describe a concise OCaml implementation of this transformation.
This is an opportunity to illustrate the use of
[Fix](https://gitlab.inria.fr/fpottier/fix/),
a library that offers facilities for
constructing (recursive) memoized functions
and for performing least fixed point computations.
The transformation of REs to DFAs is based on the description
given by Scott Owens, John Reppy and Aaron Turon in the paper
[Regular-expression derivatives re-examined](https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf).
## Preliminaries
In order to read the following, a tiny bit of vocabulary is required.
I write `epsilon` for the regular expression that
accepts the empty word, and only the empty word.
I write `zero` for the regular expression that
accepts nothing.
A regular expression `e` is **nullable** if and only if it accepts the empty word.
It is easy to determine, by inspection of the syntax of `e`, whether this is the
case (the code appears further on).
Let `iota e` stand for `epsilon` if `e` is nullable
and for `zero` otherwise. `iota e` is a regular expression
that accepts the empty word if and only if `e` accepts the empty word,
and accepts nothing else.
Let `delta a e` stand for the **derivative** of the regular expression `e`
with respect to the symbol `a`. Thinking of `e` as a set of words, `delta a e`
is obtained by keeping only the words that begin with `a` and by crossing out
in each such word this initial letter `a`. For instance, the derivative of the
set `{ ace, amid, bar }` with respect to `a` is the set `{ ce, mid }`. The
derivative of a regular expression can also be easily computed by inspection
of its syntax (read on).
## From an RE to a DFA
The main idea behind the construction is this. First, **a regular expression
`e` can be decomposed in an infinite tree**, or an infinite-state automaton,
whose vertices correspond to the iterated derivatives of `e`. Second, because
a regular expression only has a finite number of iterated derivatives (up to a
certain equational theory), **this infinite tree must in fact be the unfolding
of a finite cyclic graph**, a finite-state automaton. Because it is possible
to effectively recognize when two regular expressions are equal, it is
possible to effectively construct this finite, cyclic data structure.
In slightly greater detail, suppose, for simplicity, that we have a two-letter
alphabet, whose symbols are `a` and `b`. A word on this alphabet, then, either
is the empty word, or begins with `a`, or begins with `b`. For this reason,
an arbitrary regular expression `e` can be decomposed as follows. `e` is equal
to:
```
iota e
+ a . delta a e
+ b . delta b e
```
Here, `+` stands for choice, while `.` stands for sequencing. This can be read
as the beginning of a tree-structured automaton. The root state corresponds to
the regular expression `e`. It is an accepting state if and only if `iota e`
is nonempty, that is, if and only if `e` is nullable. Out of this state, there
are two transitions. A transition labeled `a` leads to a subtree that
corresponds to the regular expression `delta a e`. Similarly, a transition
labeled `b` leads to a subtree that corresponds to `delta b e`.
This process can be iterated: the regular expressions `delta a e` and `delta b
e` can be decomposed, too. Thus, the regular expression `e` is also equal to:
```
iota e
+ a . (iota (delta a e)
+ a . (delta a (delta a e))
+ b . (delta b (delta a e))
)
+ b . (iota (delta b e)
+ a . (delta a (delta b e))
+ b . (delta b (delta b e))
)
```
and so on, down to an arbitrary depth. This gives rise to an infinite tree,
whose vertices correspond to the iterated derivatives of `e`.
Brzozowski's key remark is that this tree is in reality finite.
Provided "equality" of regular expressions
includes the following laws,
```
0 + e = e
e + 0 = e
e + e = e
e . 0 = 0
(more laws, not shown)
```
a regular expression only has a finite
number of iterated derivatives.
(When regular expressions are viewed as semantic objects,
sets of words, this is the Myhill–Nerode theorem. Brzozowski's
insight is that, when regular expressions are viewed as syntactic objects,
an analogous result holds.)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment