I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:
curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'
That's just getting a href of a link in a table.
But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.
Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.
Is there a correct way of getting HTML elements via XPath in bash?
Kaggleand it is not well-formed XML, therefore XPath will probably fail. The source is HTML and not XHTML. You would have to remove 'incomplete' tags like<br>(luckily only a few of them in that source) before processing the source as XML using XPath.xml sel --html -T -t -von the same source and it returns like 20 errors. Wouldscrapyorlxmldo better?<div id="header2-inside" class=>. I'm not sure what creates this, but this is unparsable by an XML parser. One way would be to replace the (crappy) part of this HTML source withsed. For example, you could replace the (incomplete)<br>tags incompetitions.htmlwithsed -e "s/<br>/<br \/>/g" competitions.html > competitions2.html. Then repeat that for the other "errors". After you finished that, you can process the resulting file with an XML parser and XPath.