How to clear non-utf characters while reading a utf-8 file in Perl?

Question

I am parsing a very large log file with Perl. The code is:

open($input_handle, '<:encoding(UTF-8)', $input_file);    

while (<$input_handle>)  {                   
...
}
close($input_handle);

However, sometimes the log file contains faulty characters, and I get the following message:

utf8 "\xD0" does not map to Unicode at log_parser.pl line 32, <$input_handle> line 10920.

I am aware of the characters and I would just like to ignore them without the log message flooding my (Windows!) build server logs. I tried no warnings 'utf8'; but it did not help.

How can I suppress the message?

I guess I did not make myself clear. The original logs file is created by a black-box tool and I cannot help the characters being there. But then when my script is run (in a CI setting) IT (my script) contributes to build server logs being cluttered with my script's error message. So the error message is what I want to get rid of, I cannot influence the original log. Is this better? :) — ynka, Feb 23 '15 at 19:06
No. I am ok with one of two solutions (one already given below, I think): a) fix the character inside my script and process it correctly b) remove the error message (from my script's output = build server logs). — ynka, Feb 23 '15 at 19:16

score 3 · Accepted Answer · edited May 23 '17 at 12:06

You could do the decoding yourself instead of using the :encoding layer. By default, Encode's decode and decode_utf8 simply exchange the bad character with U+FFFD rather than warning.

$ perl -e'
   use Encode qw( decode_utf8 );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes);
   printf("U+%v04X\n", $text);
'
U+FFFD.0020.FFFD.0020.0412.000A

If the file is a mix of UTF-8, iso-8859-1 and cp1252, it may be possible to fix the file rather than simply silencing the errors, as detailed here.

How to clear non-utf characters while reading a utf-8 file in Perl?

1 Answers1