- 论坛徽章:
- 46
|
How do I match XML, HTML, or other nasty, ugly things with a regex?
(contributed by brian d foy)
If you just want to get work done, use a module and forget about the regular expressions. The XML::Parser and HTML::Parser modules are good starts, although each namespace has other parsing modules specialized for certain tasks and different ways of doing it. Start at CPAN Search ( http://search.cpan.org ) and wonder at all the work people have done for you already! :)
The problem with things such as XML is that they have balanced text containing multiple levels of balanced text, but sometimes it isn't balanced text, as in an empty tag (<br/> , for instance). Even then, things can occur out-of-order. Just when you think you've got a pattern that matches your input, someone throws you a curveball.
If you'd like to do it the hard way, scratching and clawing your way toward a right answer but constantly being disappointed, besieged by bug reports, and weary from the inordinate amount of time you have to spend reinventing a triangular wheel, then there are several things you can try before you give up in frustration:
Solve the balanced text problem from another question in perlfaq6
Try the recursive regex features in Perl 5.10 and later. See perlre
Try defining a grammar using Perl 5.10's (?DEFINE) feature.
Break the problem down into sub-problems instead of trying to use a single regex
Convince everyone not to use XML or HTML in the first place |
|