2

Is it possible to search for and remove URLs from a string in PHP. Talking about the actual text here not the HTML. Example to remove:

mywebsite.com
http://mywebsite.org
www.mywebsite.co.uk
www.my-web-site.net
sub.mywebsite.edu
etc

My issue is users submitting a description field and using it promote their own URLs. I'm not sure if its possible without generating too many false positives. I've thought about detecting the http:// or www. but that doesn't stop links like mywebsite.com

2
  • See stackoverflow.com/questions/910912/…. This link may not solve your problem, but there's some information in the answers you may find useful. Commented Oct 14, 2011 at 15:05
  • The most effective way to find URLs (whether encoded as www dot place dot com or any other way) is to use the human eyes and brain - involve the community, if at all possible. Commented Oct 14, 2011 at 16:07

3 Answers 3

1

This regex seems to do the trick:

!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\'\\\\\+&%\$#\=~_\-]+))*\b!i

It is a slight modification of this regex from Regular Expression Library.

I realize it’s a bit overwhelming, but that's to be expected when searching for URLs. Nevertheless, it matches everything on your list.

Alternatively, you could loop through each word in the description and use parse_url() to see how the word breaks down. I’ll leave the criteria for determining if it's a url to you. There’s still the potential for false positives, but they could be greatly reduced. Combined with Andrew’s idea of flagging questionable content for moderation, it could be a workable solution.

Sign up to request clarification or add additional context in comments.

7 Comments

@Code Jockey: add it to the piped list (com|edu|gov|...|ca|uk|travel)
This also doesn't filter out a lot of the URL shorteners out there (bit.ly, goo.gl, etc...)
I have yet to find the perfect regex for matching urls. I'd be interested in seeing it if anyone has.
such an expression would test the limits of a Cray supercomputer, but I'm sure it's technically possible - I'm just picking nits!
I can live without the URL shorteners. I'm just trying to stop the blatant piss taking. For example we have had stuff like; "Don't buy here, save money and come direct to our store www.douchebags.com"
|
0

You could try something that looks for .TLD, where TLD is any existing top-level domain, but that may result in too many false positives.

Would it be possible to implement a system where posts containing questionable content need moderation to be posted, but others are posted right away? I'm assuming it's a firm business requirement to disallow this type of content.

Personally, I would tend to just prevent any hyperlinking, and leave it at that. But, it's not my app.

3 Comments

I'd do this - but expand on it a little bit so after I've found a matching TLD I'd go backwards in the string a little bit and inspect the string up until I get a non-url character (like space, newline, etc.). Though this doesn't stop people doing the things where they do "example [dot] c0m"
Hyperlinking is already prevented, but users have just moved to making text links instead. I recognise that I'm never going to be able to stop the most determined linker (the example [dot] c0m) but would like to stop the casual example.com
Another option (depending upon your primary user base and their level of activity and cooperation) is a flag/vote down button, which can either get a moderator's attention, or hide/delete the comment after so many votes( or both! - though this might take more effort to implement, obviously)
0

You can easily use a regex to find the URLs, then specify what to replace them with using PHP's function preg_replace.

http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Edit: Since this is user submitted data, you might want to do some validation before you store the "description" field, and check to see if it contains a URL. If it does, you can prevent the user from saving the form.

For this, you can use preg_match, while still using a regex to find a URL.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.