|
Hi,
I'm trying to parse a CSV-like file with the following characteristics:
-- SEMICOLON (;) is used to separate fields instead of COMMA (,): that's easy because I just need to adapt an example reported in the FAQs
-- SEMICOLON can be used in the values provided that it is 'escaped' by a QUESTION MARK ('?'): that's the hard part of the story
-- QUESTION MARK is also used for escaping QUESTION MARK itself but it is not used for 'escaping' other characters in the field (to me, this is pretty ugly but that's the format!)
Please note that the parser shall handle situations like:
1) "ABC???;DE;" whose value is ABC?;DE because the first two ? are interpreted as 'escaped' ? and the sequence ?; is interpreted as 'escaped' ;
2) "ABC??;" whose vale is "ABC?" because the first ? escapes the second ? that does not escape ;
3) "ABC?DE;" whose value is "ABC?DE" because ? does not 'escape' D
I found a rather clumsy solution that pre-parses char-by-char the lines before parsing them with RecDescent. It converts the escaped sequences into un-escaped sequences and converts value separators (i.e. ';') into out-of-bound characters (i.e. '\e'); afterwards RecDescent uses '\e' as value separator. It works fine but I suspect there is a better and more 'perl-wise' solution.
BTW: RecDescent is great in particular because it allows a context sensitive parsing; I love it!
Thanks,
MCo. |