Regular expression help

lomn75 · Mar 8, 2005

Hey, I'm trying to get some regex magic to work

I've got data heirarchically composed by

Code:

<tabs>["section"] = {
...
...
...
<matching tabs>}

where <tabs> and <matching tabs> are an indeterminate, but matching, number of tab characters. Due to heirarchy, it's possible to have subsections nested (incrementing the number of tabs at each level).

I've written a function to take in the section name and return the whole section, but I can't figure out how to make it independent of the number of tabs. A non-greedy search will abort at the first subsection, and greedy will continue the length of the data.

I think the solution is akin to saying "count n tabs before ["section"] and then match n tabs at the end" but I don't know how to implement this.

I'm dealing with Perl-compatible, but can probably adapt another syntax.

HeThatKnows · Mar 8, 2005

You can do it your regex engine supports backmatching. For example: \t matches a tab-character, \t* matches zero or more tabs, (\t*) matches zero or more tabs and stores the matched characters in a buffer. Characters stored in a buffer are available to replacement strings via \# (where # is a digit 1 to 9, counting buffer from the begininning of the regex).

In backmatching, the \# is also used in the regex, like

Code:

(d*)a\1

which would match dad, ddadd, dddaddd, etc.

You'd probably need something like

Code:

\n(\t*)\["section"\]=\{(.*\n)*\1\}

The \n at the beginning starts the match with a newline, which could be a problem if the first line of your text block isn't blank. Regex ^ matches the beginning of the text block; some engines also allow it to match the beginning of a line of multi-line input. If yours allow this a ^ would be better than the \n.

xENo · Mar 8, 2005

*claps*

lomn75 · Mar 8, 2005

Hmmm... yeah, that should do it. Now I've got issues with the final implementation.

PHP's PCRE, here's the pattern:

Code:

/^(\t*)\[\"$sectionName\"\].+?^\1\},/sm

The "sm" allows ./^/$ to match \n within the parsed data.

My first version accepted a parameter and placed n \t's in the right places: (\t*) and \1. That worked fine.

(\t*) works fine as well. Done without the backreference, it starts the match at the right spot, it just continues too far (back to the root level }).

Toss in the backreference and it quits matching. I've checked the pattern matched for \1 and it's the right number of tab characters. I've tried \s as well, no better luck.

Toss in a \t* at the backreference and it matches again, though the termination is again in the wrong place.

Backreferences are supported, I've verified them in my manual and this appears to be correct syntax.

lomn75 · Mar 8, 2005

Fixed...

On a whim, I converted my " string delimiters to ' (and moved $sectionName out from the inlining) and it works.

That is,

$match = "/^(\t*)\[\"$sectionName\"\].+?^\1\},/sm";

becomes

$match = '/^(\t*)\[\"' . $sectionName . '\"\].+?^\1\},/sm';

and it's fine now.

So, thanks HTK for the backreference pointer. Anybody know why changing the string type made a difference?

HeThatKnows · Mar 9, 2005

I don't know why the different syntax would change things, though I have seen a bunch of people get weird results from inline variables. Dunno.

Out o' curiosity...does dot in your Perl-compatible regex match newlines too?

lomn75 · Mar 9, 2005

HeThatKnows said:
I don't know why the different syntax would change things, though I have seen a bunch of people get weird results from inline variables. Dunno.

Out o' curiosity...does dot in your Perl-compatible regex match newlines too?

the "s" flag at the end matches newlines with '.', the "m" flag matches newlines on '^'/'$'. That's how, just before the backreference, I can throw in the '^'.

Regular expression help

lomn75

Purple Ace

HeThatKnows

Gawd

xENo

[H]ard|Gawd

lomn75

Purple Ace

lomn75

Purple Ace

HeThatKnows

Gawd

lomn75

Purple Ace