Posted on 2005-09-26 14:26:35-07 by declan in response to 1036
Re: Parsing Greek Letters in Excel file
Hi,

I too have been having big problems with unicode-encoded stuff in Spreadsheet::ParseExcel (Perl v5.8.4, module v2603).

Happily, I've been having some success with running a modified copy of the FmtDefault class as outlined in bugs 7376 and 168. In summary, my FmtDefault looks like this:

#------------------------------------------------------------------------------ # TextFmt (for Spreadsheet::ParseExcel::FmtDefault) #------------------------------------------------------------------------------ #sub TextFmt($$;$) { # my($oThis, $sTxt, $sCode) =@_; # return $sTxt if((! defined($sCode)) || ($sCode eq '_native_')); # return pack('C*', unpack('n*', $sTxt)); #} sub TextFmt($$;$) { my($oThis, $sTxt, $sCode) =@_; return $sTxt if((! defined($sCode)) || ($sCode eq '_native_')); # Handle utf8 strings in newer perls. if ($] >= 5.008) { require Encode; return Encode::decode("UTF-16BE", $sTxt); } return pack('U*', unpack('n*', $sTxt)); #return pack('C*', unpack('n*', $sTxt)); }

This section occurs somewhere around line 68 of the file.

This change makes it so that accessing

$cell->Value

will return one of (a) an ISO-8859-1 string (because that's what the cell contained and no upgrades or conversion was necessary), (b) a UTF-8 string, or (c) an ISO-8859-1 string preceded with a null byte. This third one is apparently a bug that arises when the source text contains a mix of fonts. I'll raise an error report on it later; you can probably ignore that possibility.

For most applications, and a sufficiently modern perl, you don't have to worry about the difference between UTF-8 strings and ISO-8859-1 strings---plain Latin-1 strings should be upgraded where needed, but you may need to tell perl what's what before concatenating mixed latin1/utf strings (see http://www.ahinea.com/en/tech/perl-unicode-struggle.html for more).

As for the conversion of UTF-8 to ISO-8859-7, I've never gotten Unicode::Map and FmtUnicode working correctly. Instead, I use either 'use encoding' (and set the environment variable PERL_ENCODING to the required locale; all plain file I/O is then filtered so that it converts to/from that encoding as the data is written/read) or 'use Encode qw(encode decode)' followed by explicit conversion from utf-8 to the local character set, viz:

use Encode qw(encode decode); # ... # stuff to get values from spreadsheet #my $value=$sheet->Cells[$row][$col]; my $value="\x{398}"; # or just use a fixed value for testing, in this case capital Theta # convert Perl's internal string format into Latin-7 my $converted=encode("iso-8859-7",$value); print "<$value> = "; print "<$converted>\n"; # don't concatenate two strings with different encodings!

And from running this I get:

$ export LC_ALL=el_GR.iso88597 $ perl test.pl > test.txt $ od -ha test.txt 0000000 ce3c 3e98 3d20 3c20 3ec8 000a < N can > sp = sp < H > nl nul

I hope this was of some help.

Direct Responses: 1048 | 1965 | 11970 | Write a response
Perl Weekly newsletter
A free weekly newsletter for people who are busy to read all the blogs. click here to check it out.