| Posted on 2005-10-24 22:34:56-07 by joelfinkle |
| RTF Parser Enhancements |
|
I have an interest in enhancing the RTF parser to provide better parsing to HTML, and possibly some XML schemas.
So far, I've come across a few bug-like behaviors, a couple of which I've already corrected in my own lib:
1) The HTML generated by RTF::HTML::Converter is not XHTML-friendly. Many things are uppercase, attributes aren't double-quoted, <br> should be <br/>. I've got everything but lowercasing the attribute values corrected.
2) RTF::Control does not handle \pard tokens properly, in that \pard must clear out the current style (the same effect as setting it to Normal, or pressing Ctrl-Shift-N in MS Word). As-is, RTF::HTML::Converter will carry over the style in the previous text at the same level (i.e. not enclosed in a lower-level group). This was an easy fix.
3) I'm not sure where this is happening yet, so it isn't fixed, but parsing of attributes needs ignore settings that are the same as the current style. In other words, there's no point applying <h1><b>text</b></h1> when there's a "Heading 1" style, because Heading 1, let alone H1, is already bold. RTF caters to the least common denominator, so you can get all the formatting you need from the stream, even if you don't pay attention to the style settings, but this means that there's duplicate info that must be ignored.
4) Word 2002+ emits table row definitions twice in the first row, and RTF::HTML::Converter obliges by putting out a blank row. Definitions should accumulate until an entire row is emitted. (Not Yet Fixed)
Goals:
* Implementation of the outline styles (Heading 1, Heading 2, etc.) as begin/end <div> tokens (tricky)
* Comprehension of character styles (perhaps as <span>s
* More thorough table handling
* Image handling is probably an impossibility, unless I stumble into someone who wants to eat encoded Windows Metafiles and spit out JPGs
* Hyperlinks in the HTML converter
If anyone else out there has other wishes for this puppy, let me know. I'm intrigued so I'll be plugging away at this for a while.
Joel Finkle
Director, Product Strategy
Image Solutions, Inc.
PS THIS TEXT EDITOR SUCKS!!!!!!! |
| Direct Responses: Write a response |