I wanted to build an awesome place for people to discuss module specific issues, but I don't have any more time for this, and there are much better places to discuss Perl-related issues. I'd recommend asking your question on Stack Overflow or on Perl Monks.
If you are looking for a Perl tutorial or Perl-related news, I hope these links will serve you well.
Posted on 2007-02-09 11:03:58-08 by anna
xml::dom::parser and character merging
Hiya, I have a simple script that reads in an XML file (in utf8) using XML::DOM::Parser and spits it out again as follows:
my $inFile = $ARGV[0]; -f $inFile or die "$0: the input file $inFile could not be opened.\n"; my $writeOutFile = '>' . $ARGV[1]; open(OUT, $writeOutFile) or die "$0: the output file $writeOutFile could not be created.\n"; binmode OUT, ":utf8"; my $parser = XML::DOM::Parser->new(); my $inDoc = $parser->parsefile($inFile); print OUT $inDoc->toString; $inDoc->dispose;
The input file looks like:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="highlight.xsl"?> <PAPER> <METADATA><FILENO>W06-3117</FILENO> <APPEARED> <CONFERENCE>Workshop</CONFERENCE> <YEAR>2006</YEAR> </APPEARED> </METADATA> <BODY> ...blah... </BODY> </PAPER>
My problem is that, while the input file is displayed correctly, the output is not: some of the characters seem to have been merged and are displayed differently e.g. `aacute' followed by `n' becomes a box glyph containing 4 characters (in Firefox) and \u386e (in emacs). Can anyone explain what's happening here? Is there something I'm neglecting to do when I create my parser, perhaps, to prevent this happening? As far as I know, my files are all correctly encoded in utf8. Thanks, Anna
Direct Responses: Write a response