Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

Question

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:

$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right

... etc.

However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).

The new utf8 pragmas at the start of the script are:

use CGI qw(-utf8); use open IO => ':utf8';

I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.

And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.

If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...

Oh... and I can't just reset to open files in 8859 and filter since the DB now contains utf8 and 8859. Whoops. — Beeblbrox, Oct 21 '11 at 10:28

score 2 · Answer 1 · answered Oct 21 '11 at 14:55

2

Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.

answered Oct 21 '11 at 14:55

ikegami

367,544
15
269
518

score 1 · Accepted Answer · edited May 23 '17 at 12:28

1

Ikegami already mentioned the Encoding::FixLatin module.

Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:

unless ( utf8::decode($string) ) {
    require Encode;
    $string = Encode::decode(cp1252 => $string);
}

Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.

edited May 23 '17 at 12:28

Community

1
1

answered Oct 21 '11 at 15:55

Ilmari Karonen

49,047
9
93
153

That's great I *think* that may be the solution I need - it never occured to me to decode on a line by line basis rather than slurp the whole file as one or the other. So this will leave valid utf8 strings alone and allow me to muck around with strings that contain non-utf characters using the regex as I used to? – Beeblbrox Oct 25 '11 at 12:01
... and I didn't know about hte FixLatin module which seems to do exactly what I'm looking for, thanks again – Beeblbrox Oct 25 '11 at 12:02
Both of these solution will (except for a small chance of charset misidentification) convert all input strings to Perl Unicode character strings (which may be internally represented as UTF-8, but you really shouldn't care about that), regardless of whether they were encoded in UTF-8 or CP1252. So you won't need to do any additional "regex mucking" on top of that. (It probably wouldn't do any harm even if you did, though, since those regexes will never match valid printable Unicode strings.) – Ilmari Karonen Oct 25 '11 at 12:35
Finally got round to attempting your custom line-by-line decoding (FixLatin didn't work as far as I could tell), which has now worked a treat, thanks ever so much, I can finally draw a line under the whole stupid business and bathe in utf-8 joy (I still have a handful of occasional odd characters popping up in a very few places, nothing to do with cp1252 it seems, but it's more than tolerable) – Beeblbrox Dec 27 '11 at 12:44
To see the limitations of the two methods you mentioned, go [here](http://stackoverflow.com/q/28681864/589924). – ikegami Feb 23 '15 at 19:37

score 1 · Answer 3 · answered Oct 27 '11 at 19:49

You could also use Encode.pm's support for fallback.

use Encode qw[decode];

my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
    my ($ordinal) = @_;
    return decode('Windows-1252', pack 'C', $ordinal);
});

printf "<%s>\n", 
  join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;

Output:

<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>

score 0 · Answer 4 · answered Oct 21 '11 at 11:53

0

Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as

open $filehandle, '<:encoding(cp1252)', $filename or die ...;

and everything (tm) should work.

If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

answered Oct 21 '11 at 11:53

moritz

12,710
1
41
63

Hexdump shows 91,92,93 in places where the 1252 characters are as expected. Why the regex matches /\x91/ /\x92/ etc. fail in this case confuses me. Have noticed that my text editor reports newly created files with the 1252 characters in as ANSI and those without as utf8 - i expected the Perl utf8 IO layer to force all my files to utf8. I cannot recode until I find a way to filter out the 1252 characters from the legacy 8859 files since they are all mixed in with new utf8 files – Beeblbrox Oct 21 '11 at 12:57

Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

4 Answers4