Environment: perl v5.10.0 built for x86_64-linux-thread-multi, libxml2.so.2.6.32 on a SuSE 11.somet
+hing, XML::LibXML (PAJAS/XML-LibXML-1.69.tar.gz) installed using CPAN
Hi all,
I'm using the LibXML module to parse XHTML files that are output by another program. These files are UTF-8-encoded, and unfortunately the program generates them with a "BOM" (actually a ZWNBSP) at the beginning. Although this shouldn't be a problem in itself, it appears to be bothering the LibXML (or probably the underlying libxml2), as the presence of this ZWNBSP leads to a rather abstruse error message and a rather sudden death:
file.html:6: HTML parser error : htmlParseStartTag: misplaced <body> tag
The simplest program I could come up with to reproduce the problem looks like this:
#!/usr/bin/perl -w
use strict;
use XML::LibXML;
my $file = shift;
my $xmlparser;
my $doc;
$xmlparser = XML::LibXML->new();
#$doc = $xmlparser->parse_file( $file );
$doc = $xmlparser->parse_html_file( $file );
print "done\n";
As input I used the simplest XHTML file I could produce that HTML Tidy would accept with no errors or warnings:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">
<html>
<head>
<title>title</title>
</head>
<body>
</body>
</html>
with the hexdump:
00000000 ef bb bf 3c 21 44 4f 43 54 59 50 45 20 68 74 6d |...<!DOCTYPE htm|
00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 |l PUBLIC "-//W3C|
00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 |//DTD XHTML 1.0 |
00000030 54 72 61 6e 73 69 74 69 6f 6e 61 6c 2f 2f 45 4e |Transitional//EN|
00000040 22 20 22 78 68 74 6d 6c 31 2d 74 72 61 6e 73 69 |" "xhtml1-transi|
00000050 74 69 6f 6e 61 6c 2e 64 74 64 22 3e 0d 0a 3c 68 |tional.dtd">..<h|
00000060 74 6d 6c 3e 0d 0a 3c 68 65 61 64 3e 0d 0a 3c 74 |tml>..<head>..<t|
00000070 69 74 6c 65 3e 74 69 74 6c 65 3c 2f 74 69 74 6c |itle>title</titl|
00000080 65 3e 0d 0a 3c 2f 68 65 61 64 3e 0d 0a 3c 62 6f |e>..</head>..<bo|
00000090 64 79 3e 0d 0a 3c 2f 62 6f 64 79 3e 0d 0a 3c 2f |dy>..</body>..</|
000000a0 68 74 6d 6c 3e 0d 0a |html>..|
000000a7
(sorry, looks like the text inside a <code> tag isn't fixed-width :-| )
The ZWNBSP is well visible at the beginning of the file. After removing these three bytes, the file can be parsed without any problem.
Interestingly, replacing the call to parse_html_file with parse_file in the Perl script suppresses the error, and the call displays "done" as it should. With parse_html_file, the error message mentioned above is displayed and the program dies miserably.
Now I realized that the libxml2 is being widely used by a number of programmers and programs who would probably sooner or later have stumbled upon this "feature", but could this actually be a bug?? Or am I doing something wrong?
I haven't had a chance to test the libxml2 using another programming language to check whether the problem might reside in the Perl interface or in the library itself. Could anyone do that, or point me to a place where I could ask about it? Again, that's assuming it might really be a bug and not my lack of knowledge...
Thanks for your feedback!
Pagod