1

I have the following sub that opens a text file and attempts to ensure its encoding is one of either UTF-8, ISO-8859-15 or ASCII.

The problem I have with it is different behaviours in interactive vs. non-interactive use.

  • when I run interactively with a file that contains a UTF-8 line, $decoder is, as expected, a reference object whose name returns utf8 for that line.

  • non-interactively (as it runs as part of a subversion commit hook) guess_encoding returns a scalar string of value utf8 or iso-8859-15 for the utf8 check line, and iso-8859-15 or utf8 for the other two lines.

I can't for the life of me, work out where the difference in behaviour is coming from. If I force the encoding of the open to say <:encoding(utf8), it accepts every line as UTF-8 without question.

The problem is I can't assume that every file it receives will be UTF-8, so I don't want to force the encoding as a work-around. Another potential workaround is to parse the scalar text, but that just seems messy, especially when it seems to work correctly in an interactive context.

From the shell, I've tried overriding $LANG (as non-interactively that isn't set, nor are any of the LC_ variables), however the interactive version still runs correctly.

The commented out line that reports $Encode::Guess::NoUTFAutoGuess returns 0 in both interactive and non-interactive use when commented in.

Ultimately, the one thing we're trying to prevent is having UTF-16 or other wide-char encodings in our repository (as some of our tooling doesn't play well with it): I thought that looking for a white-list of encodings is an easier job than looking for a black-list of encodings.

sub checkEncoding
{
    my ($file) = @_;

    my ($b1, $b2, $b3);
    my $encoding = "";
    my $retval = 1;
    my $line = 0;

    say("Checking encoding of $file");
    #say($Encode::Guess::NoUTFAutoGuess);
    open (GREPFILE, "<", $file);
    while (<GREPFILE>) {
            chomp($_);
            $line++;

            my $decoder = Encode::Guess::guess_encoding($_, 'utf8');
            say("A: $decoder");
            $decoder = Encode::Guess::guess_encoding($_, 'iso-8859-15') unless ref $decoder;
            say("B: $decoder");
            $decoder = Encode::Guess::guess_encoding($_, 'ascii') unless ref $decoder;
            say("C: $decoder");

            if (ref $decoder) {
                    $encoding = $decoder->name;
            } else {
                    say "Mis-identified encoding '$decoder' on line $line: [$_]";
                    my $z = unpack('H*', $_);
                    say $z;
                    $encoding = $decoder;
                    $retval = 0;
            }

            last if ($retval == 0);
    }
    close GREPFILE;

    return $retval;
}
Chris J
  • 30,688
  • 6
  • 69
  • 111
  • 1
    I thought you were supposed to feed Encode::Guess the entire data set (file) to get the most accurate results? To me it seems you're trying to guess the encoding for each line separately. Even for a UTF-8 file, some lines may not have byte combinations that make it look like UTF-8, so for those lines, the guessing will also consider ASCII or ISO-8859-15. – Silvar Mar 08 '19 at 15:45
  • 1
    I also think you're supposed to use `<:raw` in the `open()`. – Silvar Mar 08 '19 at 15:51
  • From the known knowns perspective, ISO-8859-1=Yes. No byte value or sequence of values is incompatible with ISO-8859-1. – Tom Blodget Mar 08 '19 at 16:21
  • Wait, in one place you says `ISO-8859-1`, but the rest you say `-15`. Is `-1` a typo? – ikegami Mar 08 '19 at 21:05
  • Hi all, thanks for comments - changes made to slurp the whole file and give guess_encoding() that. However the central problem remains: when ran interactively, `$decoder` ends up being a reference type, when ran via the SVN hooks non-interatively, `$decoder` ends up being a scalar. I want to understand why that should be different? – Chris J Mar 11 '19 at 12:14
  • @Silvar - added :raw, but that doesn't change things either: I still have different behaviours in interactive vs. non-interactive. – Chris J Mar 11 '19 at 12:22

1 Answers1

1

No need to guess. For the specific options of UTF-8, ISO-8859-1 and US-ASCII, you can use Encoding::FixLatin's fix_latin. It's virtually guaranteed to succeed.

That said, I think the use of ISO-8859-1 in the OP is a typo for ISO-8859-15.

The method used by fix_latin would work just as well for ISO-8859-15 as it does for ISO-8859-1. It's simply a question of replacing _init_byte_map with the following:

sub _init_byte_map {
    foreach my $i (0x80..0xFF) {
        my $byte = chr($i);
        my $utf8 = Encode::from_to($byte, 'iso-8859-15', 'UTF-8');
        $byte_map->{$byte} = $utf8;
    }
}

Alternatively, if you're willing to assume the data is all of one encoding or another (rather than a mix), you could also use the following approach:

my $text;
if (!eval {
   $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
   1  # No exception
}) {
   $text = decode("ISO-8859-15", $bytes);
}

Keep in mind that US-ASCII is a proper subset of both UTF-8 and ISO-8859-15, so it doesn't need to be handled specially.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Added to answer. – ikegami Mar 08 '19 at 21:20
  • Hiya, ultimately I'm not looking to do a `decode`, but if that's going to be the most reliable way to detect encoding I'll run with it (the intended behaviour is to throw an error on an encoding that's not acceptable, not convert between encodings). My primary concern is the different behaviour in guess_encoding in different contexts (i.e., interactive vs. non-interactive). – Chris J Mar 11 '19 at 12:17