0

I am scraping a site and getting this:

<input type="BUTTON" value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog('https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint')">

What I want is to just grab the uid: 0XrHleUX5MudUYVwwsGDYCl

I am quite new to regex and don't really understand how it works.

I've tried doing:

'/value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog(\'https://mspfast.elavon.com/Symphony/client/client.do?uid=([a-zA-Z0-9]+)\&/'

as the regex but it does not work. I get the error of unknown modifier '/'

9
  • 2
    "I am quite new to regex and don't really understand how it works" and yet you are trying to use it instead of using a HTMl parser? Commented Dec 2, 2015 at 17:56
  • 1
    @PeeHaa if someone is not familiar with regex, do you think they would know when to use it or an HTML parser (which they no doubt are not familiar with either)? Commented Dec 2, 2015 at 17:58
  • 2
    You forgot to escape the / in the url... you should probably learn more about regexes before you try to parse html AND javascript with them simultaneously. Commented Dec 2, 2015 at 17:58
  • 1
    What makes this node unique? Which attribute, value? I could help with a DOM example. Commented Dec 2, 2015 at 18:21
  • 2
    @stribizhev this is the only input with the name geo_footprint Commented Dec 2, 2015 at 18:21

2 Answers 2

1

Here is a way to access the only element with name attribute having GEO_FOOTPRINT value:

$html = '<body><input type="BUTTON" value="Geographic Footprint" name="GEO_FOOTPRINT" onclick="return OpenModalDialog(\'https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint\')"></body>';
libxml_use_internal_errors(true);
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$link = $xpath->query('//input[@name="GEO_FOOTPRINT"]')->item(0);
$val = $link->getAttribute('onclick');

Now, once we have the text of the onclick attribute value, we can consider several ways of getting the uid value. Here is a regex one:

preg_match('~[?&]uid=([^&\s]+)~', $val, $m);
echo $m[1];

The regex [?&]uid=([^&\s]+) matches ? or &, then uid sequence, then =, and then matches and captures into Group 1 one or more characters other than & or whitespace (\s) (so that we do not cross another query param).

There can be other regexps (you may add OpenModalDialog\(\'http\S*? at the beginning of the pattern to restrict it), or try string split / substr functions, etc.

See IDEONE demo

Sign up to request clarification or add additional context in comments.

Comments

0

Here is an example with a named group:

$str = "<input type=\"BUTTON\" value=\"Geographic Footprint\" name=\"GEO_FOOTPRINT\" onclick=\"return OpenModalDialog('https://mspfast.elavon.com/Symphony/client/client.do?uid=0XrHleUX5MudUYVwwsGDYCl&novaid=5418812&readonly=Y&context=BOARDING&defaultRoute=GeographicFootprint')\">";
$regex = '/uid=(?P<uid>[^&]+)/';
// search for uid literally, afterwards match everything except an ampersand 
// and capture it in a group called "uid"

preg_match_all($regex, $str, $matches);
$uid = $matches["uid"][0];
// uid: 0XrHleUX5MudUYVwwsGDYCl

While this might work for this particular example, it's almost allways better to use a parser (e.g. SimpleXML) for these tasks.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.