I'm trying to fetch html tags and their attributes from a webpage with linux command line tools. Here's the concrete case:
Here's the task: Get all 'src' attributes of all 'script' tags of the website 'clojurescript.net' This should happen with as little ceremony as possible, almost as simple as using grep to fetch some lines of a text.
curl -L clojurescript.net | [the toolchain in question "script @src"]
http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
[...further results]
The tools I tried are: hxnormalize / hxselect, tidy, xmlstarlet. With none I could get a reliable result. This task was always straightforward when using libraries of several programming languages.
- So what's the state of the art of doing this in the CLI?
- Does is make sense to convert HTML to XML first, in order to have a cleaner tree representation?
- Often HTML is written with many syntactic mistakes - is there a default approach (which is used by common libraries) to correct/clean this loose structure?
Using CSS selectors with the additional option of only extracting an attribute would be ok. But maybe XPATH might be a better selection syntax for this.
xmlstarletcan be used to formathtmltoxmland then parsed.