UTF-8-invalid byte sequences in produced source-code
So, this is a weird one. I'm not sure if this even falls under something you'll want to fix, because apparently it hasn't been causing problems in the OCaml ecosystem.
However: with some inputs, Menhir generates an output automaton whose source-code contains invalid UTF-8 sequences. If those sequences aren't persisted through any transformation of the source-code, the parser ceases to function. (Probably obviously by this point, various parts of my tooling pipeline don't maintain invalid UTF-8 while transforming the source-code ... yeah, it's been a fun week tracking this one down.
An example such input is here (you'll need the adjacent parserUtils.mly
to build it, if you're trying to reproduce); here's how isutf8
opines on the output thereof:
$ isutf8 src/parserAutomaton.generated.ml
src/parserAutomaton.generated.ml: line 123, char 135, byte 2865: After a first byte between C2 and DF, expecting a 2nd byte between 80 and BF
Probably not relevant, but here's the one-byte havoc my (written-in-JavaScript, and thus pretty UTF-8-strict) intermediate tooling wreaks thereupon:
@|-2862,9 +2862,11 ============================================================
|5c \
|30 0
|30 0
|31 1
-|d0 .
+|ef .
+|bf .
+|bd .
|5c \
|30 0
|31 1
|36 6
I'm not sure how these byte-sequences are used internally by the generated parser; but again probably needless to say, the parser stops accepting any input as valid after these bytes get un-mangled by another tool!
Is this something that's “fix”-able? Or is such unencoded source-code content central to the operation of Menhir's automaton, somehow? (=