UTF-8-invalid byte sequences in produced source-code
So, this is a weird one. I'm not sure if this even falls under something you'll want to fix, because apparently it hasn't been causing problems in the OCaml ecosystem.
However: with some inputs, Menhir generates an output automaton whose source-code contains invalid UTF-8 sequences. If those sequences aren't persisted through any transformation of the source-code, the parser ceases to function. (Probably obviously by this point, various parts of my tooling pipeline don't maintain invalid UTF-8 while transforming the source-code ... yeah, it's been a fun week tracking this one down.
$ isutf8 src/parserAutomaton.generated.ml src/parserAutomaton.generated.ml: line 123, char 135, byte 2865: After a first byte between C2 and DF, expecting a 2nd byte between 80 and BF
@|-2862,9 +2862,11 ============================================================ |5c \ |30 0 |30 0 |31 1 -|d0 . +|ef . +|bf . +|bd . |5c \ |30 0 |31 1 |36 6
I'm not sure how these byte-sequences are used internally by the generated parser; but again probably needless to say, the parser stops accepting any input as valid after these bytes get un-mangled by another tool!
Is this something that's “fix”-able? Or is such unencoded source-code content central to the operation of Menhir's automaton, somehow? (=