5

I'm trying to get some page details (page title, images on the page, etc.) of an arbitrarily entered URL/page. I have a back-end proxy script that I use via an ajax GET in order to return the full HTML of the remote page. Once I get the ajax response back, I'm trying to run several jQuery selectors on it to extract the page details. Here's the general idea:

$.ajax({
        type: "GET",
        url: base_url + "/Services/Proxy.aspx?url=" + url,
        success: function (data) {
            //data is now the full html string contained at the url

            //generally works for images
            var potential_images = $("img", data); 

            //doesn't seem to work even if there is a title in the HTML string
            var name = $(data).filter("title").first().text();

            var description = $(data).filter("meta[name='description']").attr("content"); 

        }
    });

Sometimes using $("selector", data) seems to work while other times $(data).filter("selector") seems to work. Sometimes, neither works. When I just inspect the contents of $(data), it seems that some nodes make it through, but some just disappear. Does anyone know a consistent way to run selectors on a full HTML string?

1

1 Answer 1

2

Your question is kind of vague, especially w/r/t what input causes what code to fail, and how. It could be malformed HTML that's mucking things up - but I can only guess.

That said, your best bet is to work with $(data) rather than data:

$.ajax({
    type: "GET",
    url: base_url + "/Services/Proxy.aspx?url=" + url,
    success: function(data) {
        var $data = $(data);

        //data is now the full html string contained at the url
        //generally works for images
        var potential_images = $("img", $data);

        //doesn't seem to work even if there is a title in the HTML string
        var name = $data.filter("title").first().text();

        var description = $data.filter("meta[name='description']").attr("content");
    }
});
Sign up to request clarification or add additional context in comments.

1 Comment

Unfortunately the input could potentially be the HTML of any arbitrary page. I've tried with numerous popular websites including cnn.com, twitter.com, and espn.go.com -- all of which seem to have the same problems, especially with extracting the title.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.