Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
F
fix
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
Operations
Operations
Incidents
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
POTTIER Francois
fix
Commits
195c19d7
Commit
195c19d7
authored
Nov 27, 2018
by
POTTIER Francois
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Beginning of a blog post.
parent
d2ea8627
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
115 additions
and
0 deletions
+115
-0
demos/brz/post.md
demos/brz/post.md
+115
-0
No files found.
demos/brz/post.md
0 → 100644
View file @
195c19d7
# Been there, done that: from REs to DFAs
<!-- TEMPORARY title -->
There are several ways of compiling
a
[
regular expression
](
https://en.wikipedia.org/wiki/Regular_expression
)
(
RE
)
down to a
[
deterministic finite-state automaton
](
https://en.wikipedia.org/wiki/Deterministic_finite_automaton
)
(
DFA
)
.
One such way is based on
[
Brzozowski derivatives
](
https://en.wikipedia.org/wiki/Brzozowski_derivative
)
of regular expressions.
In this post,
I describe a concise OCaml implementation of this transformation.
This is an opportunity to illustrate the use of
[
Fix
](
https://gitlab.inria.fr/fpottier/fix/
)
,
a library that offers facilities for
constructing (recursive) memoized functions
and for performing least fixed point computations.
The transformation of REs to DFAs is based on the description
given by Scott Owens, John Reppy and Aaron Turon in the paper
[
Regular-expression derivatives re-examined
](
https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf
)
.
## Preliminaries
In order to read the following, a tiny bit of vocabulary is required.
I write
`epsilon`
for the regular expression that
accepts the empty word, and only the empty word.
I write
`zero`
for the regular expression that
accepts nothing.
A regular expression
`e`
is
**nullable**
if and only if it accepts the empty word.
It is easy to determine, by inspection of the syntax of
`e`
, whether this is the
case (the code appears further on).
Let
`iota e`
stand for
`epsilon`
if
`e`
is nullable
and for
`zero`
otherwise.
`iota e`
is a regular expression
that accepts the empty word if and only if
`e`
accepts the empty word,
and accepts nothing else.
Let
`delta a e`
stand for the
**derivative**
of the regular expression
`e`
with respect to the symbol
`a`
. Thinking of
`e`
as a set of words,
`delta a e`
is obtained by keeping only the words that begin with
`a`
and by crossing out
in each such word this initial letter
`a`
. For instance, the derivative of the
set
`{ ace, amid, bar }`
with respect to
`a`
is the set
`{ ce, mid }`
. The
derivative of a regular expression can also be easily computed by inspection
of its syntax (read on).
## From an RE to a DFA
The main idea behind the construction is this. First,
**
a regular expression
`e`
can be decomposed in an infinite tree
**
, or an infinite-state automaton,
whose vertices correspond to the iterated derivatives of
`e`
. Second, because
a regular expression only has a finite number of iterated derivatives (up to a
certain equational theory),
**
this infinite tree must in fact be the unfolding
of a finite cyclic graph
**
, a finite-state automaton. Because it is possible
to effectively recognize when two regular expressions are equal, it is
possible to effectively construct this finite, cyclic data structure.
In slightly greater detail, suppose, for simplicity, that we have a two-letter
alphabet, whose symbols are
`a`
and
`b`
. A word on this alphabet, then, either
is the empty word, or begins with
`a`
, or begins with
`b`
. For this reason,
an arbitrary regular expression
`e`
can be decomposed as follows.
`e`
is equal
to:
```
iota e
+ a . delta a e
+ b . delta b e
```
Here,
`+`
stands for choice, while
`.`
stands for sequencing. This can be read
as the beginning of a tree-structured automaton. The root state corresponds to
the regular expression
`e`
. It is an accepting state if and only if
`iota e`
is nonempty, that is, if and only if
`e`
is nullable. Out of this state, there
are two transitions. A transition labeled
`a`
leads to a subtree that
corresponds to the regular expression
`delta a e`
. Similarly, a transition
labeled
`b`
leads to a subtree that corresponds to
`delta b e`
.
This process can be iterated: the regular expressions
`delta a e`
and
`delta b
e`
can be decomposed, too. Thus, the regular expression
`e`
is also equal to:
```
iota e
+ a . (iota (delta a e)
+ a . (delta a (delta a e))
+ b . (delta b (delta a e))
)
+ b . (iota (delta b e)
+ a . (delta a (delta b e))
+ b . (delta b (delta b e))
)
```
and so on, down to an arbitrary depth. This gives rise to an infinite tree,
whose vertices correspond to the iterated derivatives of
`e`
.
Brzozowski's key remark is that this tree is in reality finite.
Provided "equality" of regular expressions
includes the following laws,
```
0 + e = e
e + 0 = e
e + e = e
e . 0 = 0
(more laws, not shown)
```
a regular expression only has a finite
number of iterated derivatives.
(When regular expressions are viewed as semantic objects,
sets of words, this is the Myhill–Nerode theorem. Brzozowski's
insight is that, when regular expressions are viewed as syntactic objects,
an analogous result holds.)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment