2

I almost have my regular expression down for skimming html pages, but have ran into two issues that I am trying to get squished before I an proceed, I need to be able to match both empty and slash (and empty closing quote) but have exhausted my ability to see what I'm doing, could someone help me with the final bit?

$pathspec='in-front';

$subjects = array(
    '<base href="http://foo.com/images/" target="_blank">', # no changes              (correct)
    '<base href="/" target="_blank">',                      # '/in-front/'            (fails)
    '<a href="https://foo.com/images/">Foo</a>',            # no changes              (correct)
    '<a href="">Foo</a>',                                   # '/in-front/'            (fails)
    '<img src="bar/foo.png" />',                            # no changes              (correct)
    '<img src="/bar/foo.png" />',                           # '/in-front/bar/foo.png' (correct)
);


foreach ($subjects AS $subject)

    echo preg_replace( '/(href|src)=["\']?\/(?!\/)([^"\'>]+)["\']?/', "$1='/$pathspec/$2'", $subject ) . "\n";

die;

Expected output is in the comments portion, Thank you.

3
  • 1
    @php_nub_qq: Huh what? Commented Dec 16, 2013 at 21:12
  • So basically what you want is to add $pathspec to any empty or root href attribute, yes? Commented Dec 16, 2013 at 21:14
  • @php_nub_qq close, read the regex carefully, and also the expected outputs along with their inputs. Commented Dec 16, 2013 at 21:18

2 Answers 2

2

See if this works for you

preg_replace('#(href|src)=["\'](?:/|/(?!\/)(\S+?)|)["\']#',"$1='/$pathspec/$2'",$subject)
Sign up to request clarification or add additional context in comments.

5 Comments

FWIW, I thiknk the original regex avoided matching // within the quotes whereas this one accepts it.
@PeterAlfvin I edited his answer to avoid protocol relative URLS, waiting for him to accept the change
The pattern will only match empty urls or urls starting with a forward slash, how could a double-slash cause an interference?
Double slashes are considered 'Protocol relative urls' aka //google.com will be protocol free for either https://google.com or http://google.com whether the passing host issuing with SSL certification or not. blog.servertastic.com/… Something to keep in mind when handling other peoples data ;)
@ehime that's something I wasn't aware of. Could you suggest the edit again because I automatically assumed it was related to the uknown modifier error and rejected it without even reading.
1

You can use this pattern:

$pattern = '~\b(?:href|src)\s*=\s*(["\']?+)\K(?:/|(?=[\s>]|\1))~i';
$replacement = "/$pathspec/";

$result = preg_replace($pattern, $replacement, $subject);

9 Comments

This pattern does not correctly avoid protocols, and matches everything: pastebin.com/Sfm4004w it also needs an escape in (["']?)
@ehime: sorry, i have forgotten the +.
Great late answer, I've already accepted but plus one for it, works great
@CasmirEtHippolyte something that I had noticed about yours that does not work is if it encounters something like this in javascript ga.src = ('' == document.location.protocol since this does not do a look behind for a whitespace, it will replace it =(
@ehime: ok, look at this: pastebin.com/PdMsq0qL . I have seen your code but not tested it, however note: preg_replace can deal with pattern/replacement arrays, using array_values and a for loop is useless. no need to escape the slash here <\/script> since the delimiter is ~
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.