Hi,
I too have been having big problems with unicode-encoded stuff in Spreadsheet::ParseExcel (Perl v5.8.4, module v2603).
Happily, I've been having some success with running a modified copy of the FmtDefault class as outlined in bugs 7376 and 168. In summary, my FmtDefault looks like this:
#------------------------------------------------------------------------------
# TextFmt (for Spreadsheet::ParseExcel::FmtDefault)
#------------------------------------------------------------------------------
#sub TextFmt($$;$) {
# my($oThis, $sTxt, $sCode) =@_;
# return $sTxt if((! defined($sCode)) || ($sCode eq '_native_'));
# return pack('C*', unpack('n*', $sTxt));
#}
sub TextFmt($$;$) {
my($oThis, $sTxt, $sCode) =@_;
return $sTxt if((! defined($sCode)) || ($sCode eq '_native_'));
# Handle utf8 strings in newer perls.
if ($] >= 5.008) {
require Encode;
return Encode::decode("UTF-16BE", $sTxt);
}
return pack('U*', unpack('n*', $sTxt));
#return pack('C*', unpack('n*', $sTxt));
}
This section occurs somewhere around line 68 of the file.
This change makes it so that accessing $cell->Value will return one of (a) an ISO-8859-1 string (because that's what the cell contained and no upgrades or conversion was necessary), (b) a UTF-8 string, or (c) an ISO-8859-1 string preceded with a null byte. This third one is apparently a bug that arises when the source text contains a mix of fonts. I'll raise an error report on it later; you can probably ignore that possibility.
For most applications, and a sufficiently modern perl, you don't have to worry about the difference between UTF-8 strings and ISO-8859-1 strings---plain Latin-1 strings should be upgraded where needed, but you may need to tell perl what's what before concatenating mixed latin1/utf strings (see http://www.ahinea.com/en/tech/perl-unicode-struggle.html for more).
As for the conversion of UTF-8 to ISO-8859-7, I've never gotten Unicode::Map and FmtUnicode working correctly. Instead, I use either 'use encoding' (and set the environment variable PERL_ENCODING to the required locale; all plain file I/O is then filtered so that it converts to/from that encoding as the data is written/read) or 'use Encode qw(encode decode)' followed by explicit conversion from utf-8 to the local character set, viz:
use Encode qw(encode decode);
# ...
# stuff to get values from spreadsheet
#my $value=$sheet->Cells[$row][$col];
my $value="\x{398}"; # or just use a fixed value for testing, in this case capital Theta
# convert Perl's internal string format into Latin-7
my $converted=encode("iso-8859-7",$value);
print "<$value> = ";
print "<$converted>\n"; # don't concatenate two strings with different encodings!
And from running this I get:
$ export LC_ALL=el_GR.iso88597
$ perl test.pl > test.txt
$ od -ha test.txt
0000000 ce3c 3e98 3d20 3c20 3ec8 000a
< N can > sp = sp < H > nl nul
I hope this was of some help. |