from-visitors-to-iterators.md 14.9 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
I have been asked whether
an automatically-generated visitor,
as produced by the [visitors](https://gitlab.inria.fr/fpottier/visitors)
syntax extension for OCaml,
can be used to construct an iterator.

It turns out that this can be done
in a simple and efficient manner.
(Up to a constant factor, the time complexity of this solution is optimal.)
As the problem is interesting and
its solution is somewhat nonobvious,
I am describing them here.

To play with this code in an OCaml toplevel,
first install `visitors` via the command `opam install visitors`.
Then, launch `ocaml` and type this:

```
#use "topfind";;
#require "visitors.ppx";;
#require "visitors.runtime";;
```

## Problem Statement

Suppose we have an arbitrary data structure that contains elements
of type `'a`. Here, it is a binary tree, but it could be anything:

```
type 'a sometree =
  | Leaf
  | Node of 'a sometree * 'a * 'a sometree
```

We would like to enumerate the elements of this data structure.
More precisely, we would like to construct an iterator, that is,
an on-demand producer of elements. Here is a simple definition
of a stateful iterator:

```
type 'a iterator =
  unit -> 'a option
```

The question is, can we construct an iterator for the type `'a sometree`,
based on an automatically-generated visitor, so that the construction is
entirely independent of the type `'a sometree`?

## Cascades

For starters, let us define cascades, which are a more pleasant kind of
iterators. A cascade is a persistent (stateless) iterator. It can be thought
of as a delayed list, that is, a list whose elements are computed only on
demand.

56 57
Cascades could (should) be part of a separate library.
As the time of writing (March, 2017), there is in fact
58 59 60 61 62 63 64 65 66 67 68 69 70 71
[a proposal](https://github.com/ocaml/ocaml/pull/1002)
to add them to OCaml's standard library.

```
type 'a cascade =
  unit -> 'a head

and 'a head =
  | Nil
  | Cons of 'a * 'a cascade
```

A delayed computation is represented as a function of type `unit -> _`.

POTTIER Francois's avatar
POTTIER Francois committed
72
The cascade constructors are as follows:
73 74 75 76

```
let nil : 'a cascade =
  fun () -> Nil
POTTIER Francois's avatar
POTTIER Francois committed
77 78 79

let cons (x : 'a) (xs : 'a cascade) : 'a cascade =
  fun () -> Cons (x, xs)
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
```

Forcing a cascade reveals its head:

```
let force xs =
  xs()
```

A cascade can be easily converted to a stateful iterator:

```
let cascade_to_iterator (xs : 'a cascade) : 'a iterator =
  let s = ref xs in
  fun () ->
    match force !s with
    | Nil ->
        s := nil;
        None
    | Cons (x, xs) ->
        s := xs;
        Some x
```

In the above code, writing `nil` into `s` may seem superfluous, but is in fact
necessary to guarantee that the computation that led to a `Nil` outcome is not
repeated in the future.

Because cascades are close cousins of lists, they are easy to work with.
Constructing a cascade for a tree-like data structure is straightforward,
whereas directly constructing a stateful iterator would be more involved.

## Back to the problem

Now, can we use some kind of visitor to turn a tree of type `'a sometree`
into a cascade of type `'a cascade`?

At first sight, this does not seem very easy, because of two issues:

 * a visitor usually traverses a tree in an eager manner, whereas we need the
   traversal to make progress only as cascade elements are demanded;

 * a visitor performs a bottom-up computation, without a left-to-right bias
   (assuming mutable state is not used), whereas a cascade enumerates elements
   in a left-to-right manner. (Or in a right-to-left manner. As will be
   apparent below, both directions are possible.)

The trick is to use another intermediate step. Instead of turning a tree
directly into a cascade, we first transform it into a generic tree-like
structure: a *delayed tree*. The first issue above is solved because, by
introducing delays into the new tree, we allow its construction to be carried
out on demand. The second issue is solved because this tree-to-tree
transformation can be carried out in a purely bottom-up manner by a `reduce`
visitor. Then, finally, it is straightforward to transform a delayed tree into
a cascade.

POTTIER Francois's avatar
POTTIER Francois committed
136
## From Delayed Trees to Cascades and Iterators
137 138 139 140 141 142 143 144 145 146 147 148 149

A delayed tree contains ordinary nodes of arity 0, 1, and 2. Furthermore,
it contains `DTDelay` nodes, of arity 1, whose child is delayed, that is,
computed only on demand.

```
type 'a delayed_tree =
  | DTZero
  | DTOne of 'a
  | DTTwo of 'a delayed_tree * 'a delayed_tree
  | DTDelay of (unit -> 'a delayed_tree)
```

150 151 152 153 154 155
A delayed tree is converted to a cascade as follows. As is often the case,
when building a cascade, one must take a continuation `k` as an argument, so
as to avoid naive and costly cascade concatenation operations. Thus, the
specification of the function call `delayed_tree_to_cascade dt k` is to
construct a cascade whose elements are the elements of the delayed tree `dt`
(listed from left to right), followed with the elements of the cascade `k`.
156 157 158 159 160 161 162 163 164 165 166

```
let rec delayed_tree_to_cascade (dt : 'a delayed_tree) (k : 'a cascade)
: 'a cascade =
  fun () -> delayed_tree_to_head dt k

and delayed_tree_to_head (dt : 'a delayed_tree) (k : 'a cascade) : 'a head =
  match dt with
  | DTZero ->
      force k
  | DTOne x ->
167
      Cons (x, k)
168
  | DTTwo (dt1, dt2) ->
169
      delayed_tree_to_head dt1 (delayed_tree_to_cascade dt2 k)
170
  | DTDelay dt ->
171
      delayed_tree_to_head (force dt) k
172 173 174 175 176 177 178 179

let delayed_tree_to_cascade (dt : 'a delayed_tree) : 'a cascade =
  delayed_tree_to_cascade dt nil

let delayed_tree_to_iterator (dt : 'a delayed_tree) : 'a iterator =
  cascade_to_iterator (delayed_tree_to_cascade dt)
```

180 181 182
In the above code, we have chosen to perform a left-to-right traversal of
the delayed tree, but could just as well have chosen a right-to-left traversal.

183 184 185
It is possible to make the delayed tree data structure implicit in the code;
this is explained [further on](#variant-getting-rid-of-explicit-delayed-trees).

186
## Constructing Delayed Trees
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

The type `'a delay` is a synonym for `'a`. We will use it as a decoration, in
a type definition, to indicate that a call to the method `visit_delay` is
desired.

```
type 'a delay = 'a
```

We now set up four constructor functions or constructor methods, which
construct delayed trees, and which we will use in a `reduce` visitor.

Delayed trees form a monoid, in the sense that we concatenate them using
`DTTwo`, and the neutral element is `DTZero`. We package these two data
constructors in the methods `zero` and `plus`, which are automatically called
in an automatically-generated `reduce` visitor.

The visitor method `visit_delay` delays the visit of a subtree by constructing
and returning a `DTDelay` node, which carries a delayed recursive call to a
visitor.

The visitor function `yield` will be invoked at elements of type `'a`. It
constructs a one-element delayed tree.

```
class ['self] delayed_tree_monoid = object (_ : 'self)

  method zero =
    DTZero

  method plus s1 s2 =
    match s1, s2 with
    | DTZero, s
    | s, DTZero ->
        s
    | _, _ ->
        DTTwo (s1, s2)

  method visit_delay: 'env 'a .
    ('env -> 'a -> 'b delayed_tree) ->
    'env -> 'a delay -> 'b delayed_tree
  = fun visit_'a env x ->
      DTDelay (fun () -> visit_'a env x)

end

let yield _env x =
  DTOne x
```

In the method `plus`, above, treating `DTZero` specially is not mandatory. It
is an optimization, which helps allocate fewer nodes.

## Generating a Visitor

It is now time to generate a `reduce` visitor for the type `'a sometree`.
This is the only part of the code which is specific of `sometree`.
Everything else is generic.

POTTIER Francois's avatar
POTTIER Francois committed
246 247 248 249 250 251
We must insert *delays* into the structure of the type `'a sometree` so as to
indicate where `visit_delay` should be called and (therefore) where `DTDelay`
nodes should be allocated. To do this, we write a copy of the definition of
the type `'a sometree`, with extra uses of `delay` in it. The new type is
actually considered equal to `'a sometree` by OCaml. Its role is purely to
carry a `@@deriving visitors` annotation.
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268

In the data constructor `Node`, the left-hand `delay` is in fact superfluous.
With or without it, our iterators will eagerly descend along the leftmost
branch of a tree, anyway.

```
type 'a mytree = 'a sometree =
  | Leaf
  | Node of 'a mytree delay * 'a * 'a mytree delay

and 'a mytree_delay =
  'a mytree delay

[@@deriving visitors { variety = "reduce"; polymorphic = true;
                       concrete = true; ancestors = ["delayed_tree_monoid"] }]
```

POTTIER Francois's avatar
POTTIER Francois committed
269
This approach is pleasant insofar as one controls exactly where delays are
POTTIER Francois's avatar
POTTIER Francois committed
270 271
inserted. However, it requires copying the type definition, which may be
unpleasant. Another approach is described
POTTIER Francois's avatar
POTTIER Francois committed
272
[further on](#variant-avoiding-duplication-of-the-type-definition).
POTTIER Francois's avatar
POTTIER Francois committed
273 274 275

## Using the Generated Visitor

276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339
For demonstration purposes, let us make our visitor verbose. In production,
one should remove `verbose_reduce` and use `reduce` instead.

```
class ['self] verbose_reduce = object (_ : 'self)
  inherit [_] reduce as super
  method! visit_Leaf visit_'a env =
    Printf.printf "Visiting a leaf.\n%!";
    super#visit_Leaf visit_'a env
  method! visit_Node visit_'a env t1 x t2 =
    Printf.printf "Visiting a node.\n%!";
    super#visit_Node visit_'a env t1 x t2
end
```

The problem is solved! There remains to write a couple lines of glue code:

```
let sometree_to_delayed_tree (t : 'a sometree) =
  new verbose_reduce # visit_mytree_delay yield () t

let sometree_to_iterator (t : 'a sometree) : 'a iterator =
  delayed_tree_to_iterator (sometree_to_delayed_tree t)
```

We use `visit_mytree_delay` above, even though `visit_mytree` would work just
as well, so as to ensure that we get a delayed tree whose root is a `DTDelay`
node. Thus, absolutely no work is performed when the iterator is created;
iteration begins only when the first element is demanded.

## Demo

A little demo helps see what is going on.

```
let t : int sometree =
  Node (Node (Leaf, 1, Leaf), 2, Node (Leaf, 3, Leaf))

let i : int iterator =
  sometree_to_iterator t
```

Here is a transcript of an OCaml toplevel session:

```
  # i();;
  Visiting a node.
  Visiting a node.
  Visiting a leaf.
  - : int option = Some 1
  # i();;
  Visiting a leaf.
  - : int option = Some 2
  # i();;
  Visiting a node.
  Visiting a leaf.
  - : int option = Some 3
  # i();;
  Visiting a leaf.
  - : int option = None
  # i();;
  - : int option = None
```

POTTIER Francois's avatar
POTTIER Francois committed
340 341
## Variant: Avoiding Duplication of the Type Definition

342 343
Earlier, we have generated a visitor for the existing type `'a sometree` in *a
posteriori* style. We have manually created an isomorphic copy of the type `'a
POTTIER Francois's avatar
POTTIER Francois committed
344 345 346 347
sometree`, which we have named `'a mytree`, and have annotated this copy with
`[@@deriving visitors { ... }]`. Furthermore, we have taken this opportunity
to insert `delay` type constructors into the type, so as to influence the
generated visitor.
POTTIER Francois's avatar
POTTIER Francois committed
348

POTTIER Francois's avatar
POTTIER Francois committed
349
This style is relatively pleasant because it is declarative and lets us
POTTIER Francois's avatar
POTTIER Francois committed
350
control exactly where delays are inserted. However, it requires duplicating
POTTIER Francois's avatar
POTTIER Francois committed
351 352 353 354
the definition of the type `'a sometree`, which may be unpleasant (if the
definition is large) or impossible (if the definition is hidden behind an
abstraction barrier).

355
Another approach is to generate a visitor in *a priori* style. When the type
POTTIER Francois's avatar
POTTIER Francois committed
356 357 358 359 360 361 362 363 364 365 366 367 368
`'a sometree` is first defined, a `reduce` visitor can be immediately
generated for it, as follows:

```
type 'a sometree =
  | Leaf
  | Node of 'a sometree * 'a * 'a sometree

[@@deriving visitors { variety = "reduce"; polymorphic = true;
                       name = "sometree_reduce" }]
```

At this point, we pretend that we do not know yet what this visitor will be
POTTIER Francois's avatar
POTTIER Francois committed
369 370
used for, so we do not annotate the type definition with `delay`, and do not
use `delayed_tree_monoid` as a base class. We get a visitor class, named
POTTIER Francois's avatar
POTTIER Francois committed
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385
`sometree_reduce`. This class has two virtual methods, `zero` and `plus`.

Then, we create a subclass, named `reduce`, which we tailor to our needs.
We mix the generated class `sometree_reduce` with the class `delayed_tree_monoid`,
and insert a delay at every tree node by overriding the method `visit_sometree`:

```
class ['self] reduce = object (self : 'self)
  inherit [_] sometree_reduce as super
  inherit [_] delayed_tree_monoid
  method! visit_sometree visit_'a env t =
    self#visit_delay (super#visit_sometree visit_'a) env t
end
```

POTTIER Francois's avatar
POTTIER Francois committed
386 387
The rest of the code is unchanged (except the method `visit_mytree_delay` no
longer exists; one calls `visit_sometree` instead).
POTTIER Francois's avatar
POTTIER Francois committed
388

389 390
## Variant: Getting Rid of Explicit Delayed Trees

POTTIER Francois's avatar
POTTIER Francois committed
391
I like to present delayed trees as an explicit data structure, because this
392 393 394 395
helps understand what is going on. However, if desired, it is possible to
hide them by *refunctionalization* (the opposite of *defunctionalization*).

Above, the function `delayed_tree_to_cascade` was written with the help of an
POTTIER Francois's avatar
POTTIER Francois committed
396
auxiliary function, `delayed_tree_to_head`. We could also have written it
POTTIER Francois's avatar
POTTIER Francois committed
397
directly, as follows:
398 399 400 401 402 403 404 405

```
let rec delayed_tree_to_cascade (dt : 'a delayed_tree) (k : 'a cascade)
: 'a cascade =
  match dt with
  | DTZero ->
      k
  | DTOne x ->
POTTIER Francois's avatar
POTTIER Francois committed
406
      cons x k
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439
  | DTTwo (dt1, dt2) ->
      delayed_tree_to_cascade dt1 (delayed_tree_to_cascade dt2 k)
  | DTDelay dt ->
      fun () -> delayed_tree_to_cascade (force dt) k ()
```

In this form, the only function that is ever applied to a delayed tree is
`delayed_tree_to_cascade`. Therefore, wherever we construct a delayed tree
`t`, we could instead directly build a closure whose behavior is equivalent to
`delayed_tree_to_cascade t`. This is *refunctionalization*.

The data structure of delayed trees seems to disappears.
The type `'a delayed_tree` can be defined as a synonym
for a `'a cascade -> 'a cascade`. I usually refer to
this type as `'a producer`: it is the type of a function
that concatenates elements in front of the cascade that it
receives as an argument.

```
type 'a producer =
  'a cascade -> 'a cascade

type 'a delayed_tree =
  'a producer
```

The four data constructors are defined as follows:

```
let _DTZero k =
  k

let _DTOne x k =
POTTIER Francois's avatar
POTTIER Francois committed
440
  cons x k
441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460

let _DTTwo dt1 dt2 k =
  dt1 (dt2 k)

let _DTDelay dt k =
  fun () -> force dt k ()
```

The reader can check that the types of these data constructors are the
same as in the previous version of the code. For instance, `_DTDelay`
has type `(unit -> 'a delayed_tree) -> 'a delayed_tree`.

The root function `delayed_tree_to_cascade` (without a continuation argument)
is defined as follows:

```
let delayed_tree_to_cascade (dt : 'a delayed_tree) : 'a cascade =
  dt nil
```

POTTIER Francois's avatar
POTTIER Francois committed
461
The delayed tree monoid uses the new versions of the four data constructors:
462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488

```
class ['self] delayed_tree_monoid = object (_ : 'self)

  method zero =
    _DTZero

  method plus =
    _DTTwo

  method visit_delay: 'env 'a .
    ('env -> 'a -> 'b delayed_tree) ->
    'env -> 'a delay -> 'b delayed_tree
  = fun visit_'a env x ->
      _DTDelay (fun () -> visit_'a env x)

end

let yield _env x =
  _DTOne x
```

The four functions `_DTZero`, `_DTOne`, `_DTTwo`, and `_DTDelay` could be
inlined, if desired, so as to make the above code seem even more concise.

The rest of the code is unchanged.

489 490 491 492 493
## Acknowledgements

KC Sivaramakrishnan asked whether a visitor can be used to construct an
iterator. Gabriel Scherer pointed out that the delayed tree data structure can
be hidden by refunctionalization.