0

I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:

curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

That's just getting a href of a link in a table.

But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.

Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.

Is there a correct way of getting HTML elements via XPath in bash?

5
  • 2
    I checked the HTML source at Kaggle and it is not well-formed XML, therefore XPath will probably fail. The source is HTML and not XHTML. You would have to remove 'incomplete' tags like <br> (luckily only a few of them in that source) before processing the source as XML using XPath. Commented May 6, 2016 at 13:06
  • @zx485 Is there any way to ignore errors? I did xml sel --html -T -t -v on the same source and it returns like 20 errors. Would scrapy or lxml do better? Commented May 6, 2016 at 13:36
  • Not using XML! Line 86 of the source file contains a (very malformed tag) line <div id="header2-inside" class=>. I'm not sure what creates this, but this is unparsable by an XML parser. One way would be to replace the (crappy) part of this HTML source with sed. For example, you could replace the (incomplete) <br> tags in competitions.html with sed -e "s/<br>/<br \/>/g" competitions.html > competitions2.html. Then repeat that for the other "errors". After you finished that, you can process the resulting file with an XML parser and XPath. Commented May 6, 2016 at 13:59
  • 1
    The central problem with processing HTML as XML are the incomplete tags. I'm not amused that this has not been fixed by standard, that means: making XHTML the default. Commented May 6, 2016 at 14:02
  • 1
    Also notice that thee is not tbody in the file. This is added by the browser to create valid html Commented May 6, 2016 at 14:36

1 Answer 1

3

Getting HTML elements via XPath in bash

from html file (with not valid xml)

One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need to have a xslt stylesheet.

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

Notice that the xapht is changed. There is no tbodyin the input file. Call xsltproc:

xsltproc --html  test.xsl competitions.html 2> /dev/null

Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).

The output is: /c/R

To use different xpath expression from command line you may use a xslt template and replace the __xpath__.

E.g. xslt template:

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

And use (e.g) sed for the replacement.

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null
Sign up to request clarification or add additional context in comments.

1 Comment

Wow! That's a great answer. Works like a charm!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.