UTF-8-invalid byte sequences in produced source-code

So, this is a weird one. I'm not sure if this even falls under something you'll want to fix, because apparently it hasn't been causing problems in the OCaml ecosystem.

However: with some inputs, Menhir generates an output automaton whose source-code contains invalid UTF-8 sequences. If those sequences aren't persisted through any transformation of the source-code, the parser ceases to function. (Probably obviously by this point, various parts of my tooling pipeline don't maintain invalid UTF-8 while transforming the source-code ... yeah, it's been a fun week tracking this one down. 😅)

An example such input is here (you'll need the adjacent parserUtils.mly to build it, if you're trying to reproduce); here's how isutf8 opines on the output thereof:

$ isutf8 src/parserAutomaton.generated.ml
src/parserAutomaton.generated.ml: line 123, char 135, byte 2865: After a first byte between C2 and DF, expecting a 2nd byte between 80 and BF

Probably not relevant, but here's the one-byte havoc my (written-in-JavaScript, and thus pretty UTF-8-strict) intermediate tooling wreaks thereupon:

@|-2862,9 +2862,11 ============================================================
 |5c \
 |30 0
 |30 0
 |31 1
-|d0 .
+|ef .
+|bf .
+|bd .
 |5c \
 |30 0
 |31 1
 |36 6

I'm not sure how these byte-sequences are used internally by the generated parser; but again probably needless to say, the parser stops accepting any input as valid after these bytes get un-mangled by another tool!

Is this something that's “fix”-able? Or is such unencoded source-code content central to the operation of Menhir's automaton, somehow? (=

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information