### Robust schemes by generalization without a generic rank

A delicate aspect of Inferno's implementation of generalization is that its notion of "scheme" is not robust, in the sense that: - Computing the set of structureless variables of a polymorphic scheme may give an incorrect result if performed too late (after other levels have been generalized), it must be performed right after generalization. - Similarly, the implementation of instantiation is fragile: if a scheme comes from level N+1, instantiation constraints at level N work as expected, but trying to instantiate the scheme at level N-1 or below may give incorrect results. Neither of those aspects are problematic for inference in a core ML language: consider `let x = t in u`, all instantiation constraints for `x` (whose scheme is the type of `t` at some level N+1) come from syntactic occurrences of `x` in `u`, at level N. However it could be problematic when considering other language features: - Frozen constraints may result in partially-frozen polymorphic schemes (some parts of a polymorphic schemes are not generic, as their type still needs to be determined by unfreezing some frozen constraints) for which computing the set of quantifiers right away is impossible. We need to be able to revisit the scheme later, after all parts are unfrozen, to compute the quantified variables. - First-class polymorphism lets you export a polymorphic scheme outside the syntactic scope of the let-body, so that a scheme from level N+2 may be instantiated at level N or below. Consider for example: let foo (y : 'b) = (* level N+1 *) {m : 'a . 'a -> 'a * 'b = (* level N+2 *) fun x -> (x, y) } in (* level N *) (foo t).m u To solve this problem, I propose to keep the rank of generic variables. Currently generalization is performed by modifying generalized variables to give them an "infinite" level `Generalization.generic`. Instead, we use a separate boolean flag `mutable generic: bool` that starts `false` and is set to `true` during generalization. A scheme is a pair of a root variable, and the rank at which the scheme was generalized. To instantiate a type scheme, we simply copy the transitive children of the root variables that are generic, and share the non-generic variables. Generic variables keep the level they come from, and in particular we define the quantified variables of a scheme as the set of generic structureless variables *at the level of the scheme*, rather than some lower level. In the example above, the generic type `'a -> 'a * 'b)` has `'a` at level N+2 and `'b` at level N+1, so type scheme of `m` (coming from level N+2) can be correctly computed as just `'a. 'a -> 'a * 'b`. (To embed a variable/type at level N as a trivial "monomorphic scheme" with no polymorphism, we simply build a scheme of level N+1 on this variable.) Remark on level (non)confusion ------------------------------ There is a subtlety due to the use of a linear array of generalization regions, rather than a tree of regions: two generalization regions may have the same integer 'rank' yet be distinct. In particular, it would be very wrong (unsound) to start unifying flexible (non-generic) variables living in different generalization regions at the same level; let us call this a "level confusion" problem. With our proposed approach to generalization, it is possible to have both *generic* variables at some level L and non-generic variables at level Lor above, and to instantiate generic variables from L at some equal or higher level. Consider for example: (* level N *) let id = (* level N+1 *) fun x -> x in let f x = (* another region at level N+1 *) let g y = ... id (x, y) ... in ... in ... The instantiation constraint on `id x` then involves copying generic nodes of level N+1 into flexible nodes at level N+2. This sounds a bit fishy at first; it sounds like we could have a level confusion problem. This example is in fact perfectly fine: irrelevant after instantiation, as the copies are made at the level of the instance, so the generic variables of level N are copied into flexible variables at level N+2 (not flexible variables of level N or anything like that); besides, both the scheme and the instantiation site may contain flexible variables of level N or below, but in this case they come from the exact same generalization regions, there is no level confusion. In general, we instantiate a scheme coming from generalization region RA at level A into a generalization region RB at level B (which may be higher or lower than A. Consider the region RC at level C (C < A, C <= B) that is the nearest common ancestor of RA and RB in the tree of generalization regions. If RB is an active descendant of RC, all siblings of RC that are not ancestors of RC must have been generalized. In particular, all regions on the path between RC and RA have been generalized, and their variables in the scheme are now generic. Thus a flexible variable reachable from the scheme must have level <= C. The ancestors of RA and RB at level <= C are below their nearest common ancestor RC, so they are exactly the same region: there is no level confusion for flexible variables.