1

I have written this regexp: <(a*)\b[^>]*>.*?</\1>

and is tested on this regexp testing site: http://gskinner.com/RegExr/?2tntr

The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate.

On the regexp testing site it works perfectly, but when put in action with JavaScript on my site it returns null.

JavaScript looks like this:

var data = $('#mainDivOnMiddleOfPage').html();

var pattern = "<(a*).*href=.*>.*</a>";
var modi = "g";

var patt = new RegExp(pattern, modi);
var result = patt.exec(data);

jQuery gets the content of the page. This is tested and verified.

Question is, why does this return null in JavaScript but what it is supposed to return in the regexp tester?

4
  • 2
    don't use regex for HTML parsing Commented Jun 29, 2012 at 21:55
  • If you're already using jQuery- why not use $("#mainDivOnMiddleOfPage a") ? Commented Jun 29, 2012 at 21:56
  • I am using $.ajax to get the html content, this means i cant (i think) use jQuery to get the A elements... but it's a good idea if i could =) Commented Jun 29, 2012 at 21:58
  • 1
    You can. It's a matter of where they are and how you have them. You have a string? Are you appending them somewhere? You can do $(HTML_STRING).children("a"), for instance, or otherwise treat the returned object as queryable html. Commented Jun 29, 2012 at 22:06

6 Answers 6

1

All <a> links:

<a[^>]*?\bhref=['\"](.*?)['\"]

Absolute links only (starting with http):

<a[^>]*?\bhref=['\"](http.*?)['\"]

JavaScript code:

var html = '<a href="test.html">';
var m = html.match(/<a[^>]*?\bhref=['"](.*?)['"]/);
print (m[1]);

See and test the code here.

Sign up to request clarification or add additional context in comments.

4 Comments

(These are local links so http cant be in the search phrase)
JS does support \b. It's failing because, if you prepare your pattern as a string, rather than as a RegExp literal (the former is needed really only if you need to reference variables in the pattern), you have to double-escape special characters. Change \b to \\b and it will work. It would also be sensible to force the closing quote of the href attribute to match the opening one (i.e. double vs. single). Can be achieved with a back-reference. The revised pattern, as a literal rather than a string, would be /<a[^>]*?\bhref=('|")(.*?)\1/
@BjørnØyvindHalvorsen - I have updated my answer with code and test link
You'll also need the g global flag to match all links - this will match just the first.
1

I use the following code to do the same thing and it works for me, try it out

var data = document.getElementById('mainDivOnMiddleOfPage').textContent;

var result = data.match(/<(a*).*href=.*>.*<\/a>/);
​

6 Comments

data.match is not a function... might i need some external liberaries?
Sorry... match doen't exists in JS. It's couterpart is exec.
Just added a fix to the answer, you need to get the textContent. try it now
Bear in mind textContent is a relatively new property. Use innerHTML if you need to support IE <= 8.
|
1

Going to go ahead and post this here, since I think it's what you want -- it is not a RegEx solution, however.

$(function(){
    $.ajax({
        url: "test.htm",
        success: function(data){
            var array_of_links = $.makeArray($("a",data));
            // do your stuff here
        }
    });

});

Comments

1

I'm conscious an answer has been chosen. However it's worth mentioning that the current REGEX solutions match the tags but not the actual HREFs in isolation.

This is where JavaScript falls down, since its somewhat simplistic implementation of REGEX does not allow for the capturing of sub-groups when the global g flag is specified.

One way round this is to exploit the REGEX replacement callback. This will get just the link HREFs, not the tags.

var html = document.body.innerHTML,
    links = [];
html.replace(/<a[^>]*?href=('|")(.*?)\1/gi, function($0, $1, $2) {
    links.push($2);
});
//links is now an array of hrefs

It also uses a back-reference to close the href attribute, i.e. making sure both opening and closing quote are single or double, not mixed.

Sidenote: as others have mentioned, where possible, you'd want to DOM this rather than REGEX.

Comments

1

"The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate."

I won't add another regex answer, but just want to point out that if you have hold of the document (not just the html) then it's easier to walk trhough the links collection. That contains all <a href="">'s but also all <area> elements:

for (var link, links = document.links, n = links.length, i=0; i<n; i++){
    link = links[i];
    switch (link.tagName){
        case "A":
            //do something with the link
            break;
        case "AREA":
            //do something with the area.
            break;
    } 
} 

Comments

0

Your problem is that you are not compiling your regex:

patt.compile();

You have to call it before using with the exec() method.

2 Comments

compile() is deprecated and does not do anything extra than the RegExp constructor. Thus, it is not necessary to call it before running exec().
Its necessary for some browsers like IE, but not in all cases. Do a test yourself in IE with the case he pointed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.