Posted on 2005-05-17 03:03:38-07 by jdcook
How do I know which end tag goes with which start tag?
I am attempting to write a program that parses an HTML file and then does some text substitution. I am running into a problem, however. Perhaps it is because I don't understand how to use the HTML::Tokeparser::Simple package. Here's what I am trying to do: Loop through a file looking for either <span> or <div> tags. It then checks to see if there is a certain attribute there (editable='true'). If so, I want it to grab all text (including any additional tags) between it and the closing tag of this tag "set". Some sample XHTML to illustrate:
<div class="content"> <span editable="true" id="nvumaincontent">stuff</span> <span editable="true" optional="true" id="nvutest5"><span style="background: red;">more stuff</span +> <p>test</p></span><span editable="true" id="nvu56">even more stuff <!-- this is the beginning of a comment --> </span> <div editable="true" optional="true" repeatable="true" movable="true" id="nvutest43">incredible boat loads of stuff <!-- this is another comment --> </div> <div editable="true" id="anotherblock4">an unbelievable quantity of stuff! <!-- yet another comment --> <div id="newtest">Yo, dude!</div> </div> <!-- end main content --> </div>

I am mostly getting the results I want with one exception. If you see the lines above that have more than one <.span> or </div> in a row, I am only able to get the first of those tags. Is there anything I can do to tell whether or not the </span> or </div> tags actually go with the relevant opening tag? I am posting some code as follows:
use File::Find; use strict; use HTML::TokeParser::Simple; #my $new_folder = 'new_html/'; my @html_docs = "test5.html"; our $spancontents=""; my @files; my $ByteCount=0; my $filelist=""; my $isflagon=0; my $idflag; my %spancontents; my $templatelocation; my $currentdoc; foreach my $doc ( @html_docs ) { $currentdoc=$doc; my $p = HTML::TokeParser::Simple->new( file => $doc ); while ( my $token = $p->get_token ) { if ($token->is_start_tag('span') or $token->is_start_tag('div')) { if ($token->get_attr('editable')=~/true/) { $isflagon=1; $idflag=$token->get_attr('id'); } } if ( ($token->is_start_tag('span') and $isflagon) .. $token->is_end_tag('span') and $isflag +on){ my $text=$token->as_is; $spancontents.=$text.","; #next; } if ( ($token->is_start_tag('div') and $isflagon) .. $token->is_end_tag('div')){ my $text=$token->as_is; $spancontents.=$text.","; #next; #not sure if needed, seems to mess things up } if (($token->is_end_tag('span') or $token->is_end_tag('div')) and $isflagon) { $isflagon=0; #$spancontents.=$token->as_is.","; #not sure if needed, seems to mess things up $spancontents{"$idflag"}.=$spancontents; $spancontents=""; } if ($token->is_start_tag('html')) { my $attrs=$token->get_attr('templateref'); $templatelocation=$attrs; } } } print "\n\n\n"; foreach my $value (keys %spancontents) { print "value is $value\n"; print "\nMy $value = $spancontents{$value} \n\n-------------------------\n"; }

Here is some sample output using similar HTML as above:
value is anotherblock4 My anotherblock4 = <div editable="true" id="anotherblock4">,an, unbelievable qua ntity of stuff! ,<!-- yet another comment -->, ,<div id="newtest">,Yo, dude!,</div>, ------------------------- value is nvutest43 My nvutest43 = <div editable="true" optional="true" repeatable="true" movable="t rue" id="nvutest43">,incredible boat loads of stuff ,<!-- this is another comment -->, ,</div>, ------------------------- value is nvutest5 My nvutest5 = <span editable="true" optional="true" id="nvutest5">,<span style=" background: red;">,more stuff,</span>, ------------------------- value is nvumaincontent My nvumaincontent = <span editable="true" id="nvumaincontent">,stuff,</span>, ------------------------- value is nvu56 My nvu56 = <span editable="true" id="nvu56">,even more stuff ,<!-- this is the beginning of a comment -->, ,</span>, -------------------------

Notice that there is only one div or span closing tag under sections nvutest5 and anotherblock4. There should be two of them (i.e. two div's or two span's). My bottom line question is this: how can I tell which opening tag that the closing tag I am retrieving using get_end_tag goes to? Thanks for any help you can give and thanks for making this module available. Joshua Cook
Direct Responses: 465 | Write a response
Posted on 2005-05-17 03:33:06-07 by ovid in response to 464
Re: How do I know which end tag goes with which start tag?

Hi Joshua

The problem with HTML is that it is inherently free form and stumbling across misnested tags can throw the best algorithms for a loop (no bad pun intended). Assuming your tags are properly nested, though, the best way of dealing with this is to either switch to HTML::TreeBuilder (which would let you treat the spans as leafs on a tree), or to maintain either a tag stack or a tag count. I've chose then the latter in the following HTML snippet:

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple 3.13; my $parser = HTML::TokeParser::Simple->new(handle => \*DATA); while (my $token = $parser->get_token) { next unless $token->is_start_tag('span'); my $html = get_element($parser, 'span'); print $html; } # pass this the parser and the name of the tag you're interested in. sub get_element { my ($parser, $tag) = @_; my $html = ''; my $more_tags = 0; while (my $token = $parser->get_token) { return $html if $token->is_end_tag($tag) && ! $more_tags; $more_tags++ if $token->is_start_tag($tag); $more_tags-- if $token->is_end_tag($tag); $html .= $token->as_is; } return $html; } __DATA__ <head> <body> <span> <span foo="bar"> stuff </span> </span> </body> </head>
Direct Responses: Write a response
Perl Weekly newsletter
A free weekly newsletter for people who are busy to read all the blogs. click here to check it out.