|
Hiya,
I have a simple script that reads in an XML file (in utf8) using XML::DOM::Parser and spits it out again as follows:
my $inFile = $ARGV[0];
-f $inFile or die "$0: the input file $inFile could not be opened.\n";
my $writeOutFile = '>' . $ARGV[1];
open(OUT, $writeOutFile) or die "$0: the output file $writeOutFile could not be created.\n";
binmode OUT, ":utf8";
my $parser = XML::DOM::Parser->new();
my $inDoc = $parser->parsefile($inFile);
print OUT $inDoc->toString;
$inDoc->dispose;
The input file looks like:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="highlight.xsl"?>
<PAPER>
<METADATA><FILENO>W06-3117</FILENO>
<APPEARED>
<CONFERENCE>Workshop</CONFERENCE>
<YEAR>2006</YEAR>
</APPEARED>
</METADATA>
<BODY>
...blah...
</BODY>
</PAPER>
My problem is that, while the input file is displayed correctly, the output is not: some of the characters seem to have been merged and are displayed differently e.g. `aacute' followed by `n' becomes a box glyph containing 4 characters (in Firefox) and \u386e (in emacs). Can anyone explain what's happening here? Is there something I'm neglecting to do when I create my parser, perhaps, to prevent this happening? As far as I know, my files are all correctly encoded in utf8.
Thanks,
Anna
|