Regex problem

Wiseguy2001 · Sep 25, 2006

Hi guys,
I'm trying to extract Test Page from Title="Test Page". However its coming out with et Page (cutting the T and the e, notice the e in Page is fine). My assumption is that these letters are being cut because the regex parser (.net) is treating the [^Title=""] as a list of characters rather than a string. The whole expression is below:

[^Title=""].*[^""]

eloj · Sep 25, 2006

Typically the use of [] indicate a character class, so.... And the use of ^ in a character class is a "not" operation.

Maybe Title=\"(.*)\" and you might possibly need to make it non-greedy. And non-case sensitive.

Wiseguy2001 · Sep 25, 2006

Thanks for the attempt, but unfortunately that made it worse.

I've re-arranged it but im still getting the leader double quote, I can easily take that out with .Substring(1). Seems a shame that I can't do it all with Regex.

Can anybody recommend any books on the subject?

eloj · Sep 25, 2006

I guess you'll have to call Microsoft then. I don't know how they fucked up the .net regexp library.

Code:

eloj@nynaeve:~/dev =) $ cat regexptest.pl
#!/usr/bin/perl
my $a = 'somejunk Title="Test "Page"" some more junk here';
print "The title is '$1'\n" if( $a =~ m/title=\"(.*)\"/i );
eloj@nynaeve:~/dev =) $ ./regexptest.pl
The title is 'Test "Page"'
eloj@nynaeve:~/dev =) $

eloj · Sep 25, 2006

To be explicit, in your first example you define a character class of characters to not match, this class includes 'T' and 'e', so it's no wonder that you don't get those in your capture of "Test page", is it? The details of why are a bit tricky (but try looking at your string one character at a time left to right, testing against your character class), but what you want to do is in practice very simple once you have the basics.

The correct way to do this is as I showed you, but how to specify that in ".NETty" I don't know. Shouldn't be too hard, and I expect that you still have to use parenthesis to select which part(s) to capture.

Wiseguy2001 · Sep 25, 2006

Hey thanks for the help. I don't think I need to call MS, I'm sure the bug is in front of my laptop. I tied modifying your code, but know joy. I've got it working but it is anything but elegant, expressions within expressions then further sting manipulation after to polish it off.

I'm not proud of it but here it is (S is the input string).

Code:

txtTitle.Text = Regex.Match(Regex.Match(S, @"Title="".*""").ToString(), @"[^Title=][^""|.]*").ToString().Substring(1);

Anybody know a good book on the subject that isn't insanely dry?

[MS] · Sep 26, 2006

Works for me:

Code:

string str = "dobedobedo... Title=\"Test Page\" huh?";
Console.WriteLine(Regex.Match(str, "Title=\"(.*)\"").Groups[1]);

So, lets look at what you're doing:

this: Regex.Match(S, @"Title="".*""")
means: find a substring in S bounded on the left by Title=" followed by any number of characters other than newline, and ending with ".

You then call ToString on the match which, which is spec'ed to return the full capture. So this would be the string Title="something".

you then call Regex.Match(priorResult, @"[^Title=][^""|.]*")
which means, well, I don't know. It means seek to the first character that is not any of e, i, l, T, t, or =; which must be immediately must be followed by any number of characters that are not " or are not any character other than newline. I don't thing this is what you are trying to do.

The example eloj gave is correct, and the code segment above has the C# code to do that.

For reference, the expression reference on MSDN is pretty good:
http://msdn.microsoft.com/library/en-us/cpgenref/html/cpconRegularExpressionsLanguageElements.asp

Wiseguy2001 · Sep 26, 2006

Ah, much better. But did you have to make it look soo easy? lol

Thanks for the link, I overlooked the group class which makes life much easier.

Regex problem

Wiseguy2001

2[H]4U

eloj

2[H]4U

Wiseguy2001

2[H]4U

eloj

2[H]4U

eloj

2[H]4U

Wiseguy2001

2[H]4U

[MS]

[H]ard|Gawd

Wiseguy2001

2[H]4U