1

In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.

My actual Regex is:

(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)

This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332

I just want to get in this case the URL without the ?efdf=332 at the end.

So how should I change the regex?

3
  • Hi Omegavirus, welcome to Stack Overflow. I've noticed that you went to a lot of trouble to get your regex formatted correctly, but you could have had it a lot easier. Just paste the original regex, mark it and press Ctrl-K. This will format the text as verbatim text (like HTML pre tag). Much less potential for errors. Commented Nov 21, 2010 at 14:09
  • oh i didn't know that, thanks ;) and the regex is from my c# program so \ are escaped. forgot to say that. Commented Nov 21, 2010 at 14:10
  • 2
    In C#, use verbatim strings (@"foo") with regexes. Then you don't have to escape your backslashes. You'll go crazy otherwise. Regexes are hard enough to read already... Commented Nov 21, 2010 at 14:15

2 Answers 2

2
http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+

does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.

In C#:

Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")

That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)

Sign up to request clarification or add additional context in comments.

3 Comments

many thanks ;) https etc is no problem cause i won't need them.. http is all :) just tested your regex and its nearly working. but now i get urls like blabla.com/forums/&ampblabla how to filter these also out?
So you just want domain + path without any parameters?
Try removing the & from the regex. Same with any other characters you don't want to allow.
0

You can use the Uri class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.