0

have this webpage http://www.westminster.ac.uk/schools/computing/undergraduate . I'm using hpple to retrieve data (just started learning about it). I want to specifically retrieve the href from he main page, how can i do this?

I have this line - "NSArray *elements = [xpathParser search:@"//a"];" is able to retrieve all of the href links within the page however how can i retrieve just the ones in the main content? e.g. "BSc Honors Busniess Information Systems"? whats the syntax for it?

1
  • What is main content? Can you provide sample? Commented Aug 17, 2011 at 14:34

1 Answer 1

1

It looks like all of the "main content" stuff is found underneath elements with id attributes like "content_div_XXXX" where XXXX is some randomly generated sequence. You might be able to get at what you want using an XPath that looks something like:

//div[starts-with(@id,'content_div')]//a

You should be able to get something like this working, although you'd have to try it out and perhaps tweak it a bit to make it work precisely as you want. Refer to W3Schools XPath page for a good set of XPath tutorials

Sign up to request clarification or add additional context in comments.

4 Comments

That actually works pretty well i do have some questions. So from what i gather in the tutorial whatever is in '[]' is is used to filter data. So in this case, were looking for a 'div that has an element id and contains the word 'content_div'?.
The above XPath selects all <a> elements that have a <div> ancestor with an "id" attribute that starts with the string "content_div". The bracket notation is how you implement conditional checks. The '@" syntax is how you reference attributes. If you have an additional question please update the post.
Thanks, assuming that i want to retrieve all the text from this page (the page content - in text format) westminster.ac.uk/schools/computing/undergraduate/… would the syntax be this?div[starts-with(@id,'content_div')] without the hyperlink //a? Also when you look at the format of the webpage (to see the elements etc), how do u do it? do u just use firefox/mozilla etc to look at the source code raw? or is there a way to see it in xml format? thanks..
If you want to select the text within the <a> element rather than the <a> element itself, use //div[starts-with(@id,'content_div')]//a/text(). To see the elements of the HTML for this web page, I typically just view source from my browser. However, if you want more robust ways to analyze the HTML most browsers have built-in developer tools or plugins to help with that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.