Help with a regular expression quandry

noen

n00b
Joined
May 10, 2002
Messages
27
Ok so I am rusty as hell with regular expressions. So far I have been able to eek by with some horrid amalgamations of expressions to get the right stuff parsed out, but now I seem to be stuck.

basically I want to search through an html document for
<td class="foo">bar</td>

Where foo can be any classname and bar can be mixed html content (contain other tables, etc). Right now this is my completely non-working expression:

"/<td.*class=\"([^\"]*)\">([^<\/td]*)/"

Now I know from previous expressions that the class=\"([^\"]*)\" works fine, but how can I tell it to find the proper </td> assuming there can be </td> tags in the intermediate content I want to capture?

My current guess is I am going to have to write a custom string parser to be sure that it can handle internal tables, but I was hoping there may be some magical regex to save me from this and keep my code nice and tidy.

Any help is appreciated
 
Is it possible to add a comment after the closing </td> for which you're searching?

That way, you'd search for something within the comments that you know wouldn't be in the document elsewhere.

In Thinking in Java, Eckel uses something like this in his code scripts, to tell a Python script where to end.

For example, use something like:

<!-- :=+=: END -->

Or make up you own nice ending. :)
 
You could always hack together something that solves this problem for any finite nesting depth, but depth+1 will still break it. Just a limitation of regexen (unless you're dealing with REs that are significantly more powerful than the REs you deal with in CS theory courses...)
 
No, not possible. As ameoba said, it's the balancing problem. You can't check whether the tags are correctly balanced or not.

This Boost Regex is as far as I got - you can't handle your "bar" with any kind of regular expressions unless you are willing to prohibit nested tables.

regex("<td\\s*class=(\"([^<&\"]|&[^;]*;)*\"|'([^<&']|&[^;]*;)*')>")
 
Sorry, liqdfire, this cannot be done with any regular expressions, anywhere.
 
Originally posted by carl67lp
Is it possible to add a comment after the closing </td> for which you're searching?
...
For example, use something like:
<!-- :=+=: END -->
Or make up you own nice ending. :)
i'd have to agree that this seems like the best option if you can do it.
 
Well... you -could- scan the document once to figure how deep the max depth of nesting is and dynamically create a regex that handles them properly. Finding max nesting depth shouldn't be too hard and the structure of a regex that handles multiple possible levels of nesting should be fairly regular & easy to write a function for generating them.
 
Back
Top