Posted on 2005-09-26 11:52:23-07 by partha
Parsing Greek Letters in Excel file
Hi,

I have an Excel file which contains some greek characters. I want to parse it and create a text file with its contents. I used a test program suplied in the sample files for Spreadsheet::ParseExcel package to parse Unicode characters. The resukting text file does not show the characters in Greek but some illegible characters.

I am supplying here the code I used.
------------------------
# This script requieres Unicode::Map use strict; if($#ARGV < 1) { print<<EOF; Usage: $0 Excel_File [Code] Code: CP932, CP1251, ... (same as Unicode::Map) EOF exit; } use Spreadsheet::ParseExcel; use Spreadsheet::ParseExcel::FmtUnicode; use Spreadsheet::ParseExcel::FmtDefault; my $oExcel = new Spreadsheet::ParseExcel; my $oFmtJ = Spreadsheet::ParseExcel::FmtUnicode->new(Unicode_Map => $ARGV[1]); my $oBook = $oExcel->Parse($ARGV[0], $oFmtJ); my($iR, $iC, $oWkS, $oWkC); print "=========================================\n"; print "FILE :", $oBook->{File} , "\n"; print "COUNT :", $oBook->{SheetCount} , "\n"; print "AUTHOR:", $oBook->{Author} , "\n"; my $table = []; for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++) { $oWkS = $oBook->{Worksheet}[$iSheet]; print "--------- SHEET:", $oWkS->{Name}, "\n"; for(my $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++) { # print "ROW HEIGHT:", $oWkS->{RowHeight}[$iR], "\n"; for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++) { $oWkC = $oWkS->{Cells}[$iR][$iC]; print "( $iR , $iC ) =>", $oWkC->Value, "\n" if($oWkC); } } }

--------------------------
I run the following on the commandline :
perl testgr.pl GREEK > test.txt

I tried different Character codes i.e ISO-8859-7 also. but still no success.
Can anyone let me know how I can do it.

Partha
Direct Responses: 1038 | Write a response
Posted on 2005-09-26 14:26:35-07 by declan in response to 1036
Re: Parsing Greek Letters in Excel file
Hi,

I too have been having big problems with unicode-encoded stuff in Spreadsheet::ParseExcel (Perl v5.8.4, module v2603).

Happily, I've been having some success with running a modified copy of the FmtDefault class as outlined in bugs 7376 and 168. In summary, my FmtDefault looks like this:

#------------------------------------------------------------------------------ # TextFmt (for Spreadsheet::ParseExcel::FmtDefault) #------------------------------------------------------------------------------ #sub TextFmt($$;$) { # my($oThis, $sTxt, $sCode) =@_; # return $sTxt if((! defined($sCode)) || ($sCode eq '_native_')); # return pack('C*', unpack('n*', $sTxt)); #} sub TextFmt($$;$) { my($oThis, $sTxt, $sCode) =@_; return $sTxt if((! defined($sCode)) || ($sCode eq '_native_')); # Handle utf8 strings in newer perls. if ($] >= 5.008) { require Encode; return Encode::decode("UTF-16BE", $sTxt); } return pack('U*', unpack('n*', $sTxt)); #return pack('C*', unpack('n*', $sTxt)); }

This section occurs somewhere around line 68 of the file.

This change makes it so that accessing

$cell->Value

will return one of (a) an ISO-8859-1 string (because that's what the cell contained and no upgrades or conversion was necessary), (b) a UTF-8 string, or (c) an ISO-8859-1 string preceded with a null byte. This third one is apparently a bug that arises when the source text contains a mix of fonts. I'll raise an error report on it later; you can probably ignore that possibility.

For most applications, and a sufficiently modern perl, you don't have to worry about the difference between UTF-8 strings and ISO-8859-1 strings---plain Latin-1 strings should be upgraded where needed, but you may need to tell perl what's what before concatenating mixed latin1/utf strings (see http://www.ahinea.com/en/tech/perl-unicode-struggle.html for more).

As for the conversion of UTF-8 to ISO-8859-7, I've never gotten Unicode::Map and FmtUnicode working correctly. Instead, I use either 'use encoding' (and set the environment variable PERL_ENCODING to the required locale; all plain file I/O is then filtered so that it converts to/from that encoding as the data is written/read) or 'use Encode qw(encode decode)' followed by explicit conversion from utf-8 to the local character set, viz:

use Encode qw(encode decode); # ... # stuff to get values from spreadsheet #my $value=$sheet->Cells[$row][$col]; my $value="\x{398}"; # or just use a fixed value for testing, in this case capital Theta # convert Perl's internal string format into Latin-7 my $converted=encode("iso-8859-7",$value); print "<$value> = "; print "<$converted>\n"; # don't concatenate two strings with different encodings!

And from running this I get:

$ export LC_ALL=el_GR.iso88597 $ perl test.pl > test.txt $ od -ha test.txt 0000000 ce3c 3e98 3d20 3c20 3ec8 000a < N can > sp = sp < H > nl nul

I hope this was of some help.

Direct Responses: 1048 | 1965 | 11970 | Write a response
Posted on 2005-09-27 10:03:16-07 by partha in response to 1038
Re: Parsing Greek Letters in Excel file
Hi,

Your solution works great. I am only interested in the UTF8 format as I will be creating XML docs using this. The only problem I have is that it gives warnings, which I can suppress with -X oprtion to Perl. The warning I get is

Wide character in print at x3.pl line 36.

The warning is given when it tried to print UTF8/multibyte characters. Is there a way other than suppressing warnings to prevent the above warnings to appear.
Direct Responses: 1049 | Write a response
Posted on 2005-09-27 10:38:31-07 by declan in response to 1048
Re: Parsing Greek Letters in Excel file
Hi again,

I'm glad to hear that the fix worked for you too!

The wide character warning is covered in the link that I mentioned in my last message. The URL is

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

The relevant bit is the second-last section of that document. Specifically, add this to your code to tell perl that STDOUT accepts UTF-8:

binmode STDOUT, ":utf8";
Direct Responses: 1057 | Write a response
Posted on 2005-09-27 16:56:10-07 by partha in response to 1049
Re: Parsing Greek Letters in Excel file
Yes it worked. I read your article about the Unicode struggle. I am impressed.
Direct Responses: Write a response
Posted on 2006-03-18 09:51:44-08 by jmac9601 in response to 1038
Re: Parsing Greek Letters in Excel file
I am experiencing a similar problem when attempting to parse an excel file.
I have applied the Spreadsheet::ParseExcel::FmtDefault patch. While that patch
does eliminate the 'C' to 'U' error, I am now getting the following errors
(my apologies for the voluminous text):

Uncaught exception from user code:
UTF-16BE:Malformed LO surrogate d899 at /usr/lib/perl5/5.8.0/i386-linux-thread-multi/Encode.pm line 154.
Encode::decode('UTF-16BE','X\x{e7}\x{9e}X9\x{d8}y\x{c4}X \x{d8}\x{99}\x{91}L\x{8b}1-r\x{bb}\x{ee}\x{a4}\x{c1}\x{e1}\x{1d}(\x{e4}v ccfu\x{6}:c\x{1a}\x{9d}\x{b3}\x{ea}\x{81}\x{18}x\x{eb}\x{c1}\x{ed}5\x{d3}\x{91}\x{c6}3\x{e4}\x{99}\x{c6}O\x{9a}8:X4e\x{b6}...') called at /usr/lib/perl5/site_perl/5.8.0/Spreadsheet/ParseExcel/FmtDefault.pm line 77

Spreadsheet::ParseExcel::FmtDefault::TextFmt('Spreadsheet::ParseExcel::FmtDefault=HASH(0x86c2fd4)','X\x{e7}\x{9e}X9\x{d8}y\x{c4}X \x{d8}\x{99}\x{91}L\x{8b}1-r\x{bb}\x{ee}\x{a4}\x{c1}\x{e1}\x{1d}(\x{e4}v ccfu\x{6}:c\x{1a}\x{9d}\x{b3}\x{ea}\x{81}\x{18}x\x{eb}\x{c1}\x{ed}5\x{d3}\x{91}\x{c6}3\x{e4}\x{99}\x{c6}O\x{9a}8:X4e\x{b6}...','ucs2') called at /usr/lib/perl5/site_perl/5.8.0/Spreadsheet/ParseExcel.pm line 1094

Spreadsheet::ParseExcel::_convBIFF8String('Spreadsheet::ParseExcel::Workbook=HASH(0x86c2f2c)',' \x{d2}Y4\x{e5}\x{e7}XX\x{9e}\x{d8}9\x{c4}y X\x{99}\x{d8}L\x{91}1\x{8b}r-\x{ee}\x{bb}\x{c1}\x{a4}\x{1d}\x{e1}\x{e4}( vccuf:\x{6}\x{1a}c\x{b3}\x{9d}\x{81}\x{ea}x\x{18}\x{c1}\x{eb}5\x{ed}\x{91}\x{d3}3\x{c6}\x{99}\x{e4}O\x{c6}8\x{9a}...') called at /usr/lib/perl5/site_perl/5.8.0/Spreadsheet/ParseExcel.pm line 1052

Spreadsheet::ParseExcel::_subWriteAccess('Spreadsheet::ParseExcel::Workbook=HASH(0x86c2f2c)',92,112,' \x{d2}Y4\x{e5}\x{e7}XX\x{9e}\x{d8}9\x{c4}y X\x{99}\x{d8}L\x{91}1\x{8b}r-\x{ee}\x{bb}\x{c1}\x{a4}\x{1d}\x{e1}\x{e4}( vccuf:\x{6}\x{1a}c\x{b3}\x{9d}\x{81}\x{ea}x\x{18}\x{c1}\x{eb}5\x{ed}\x{91}\x{d3}3\x{c6}\x{99}\x{e4}O\x{c6}8\x{9a}...') called at /usr/lib/perl5/site_perl/5.8.0/Spreadsheet/ParseExcel.pm line 428

Any assistance would be greatly appreciated.
Direct Responses: Write a response
Posted on 2009-12-17 13:42:20-08 by peterantok in response to 1038
Re: Parsing Greek Letters in Excel file
Hi! I used your solution to fix a problem with Hungarian characters o" and u". It worked perfectly. Thanks!
Direct Responses: Write a response
Perl Weekly newsletter
A free weekly newsletter for people who are busy to read all the blogs. click here to check it out.