basic webscraping from the CLI

Question

I'm trying to fetch html tags and their attributes from a webpage with linux command line tools. Here's the concrete case:

Here's the task: Get all 'src' attributes of all 'script' tags of the website 'clojurescript.net' This should happen with as little ceremony as possible, almost as simple as using grep to fetch some lines of a text.

curl -L clojurescript.net | [the toolchain in question "script @src"]
http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
[...further results]

The tools I tried are: hxnormalize / hxselect, tidy, xmlstarlet. With none I could get a reliable result. This task was always straightforward when using libraries of several programming languages.

So what's the state of the art of doing this in the CLI?
Does is make sense to convert HTML to XML first, in order to have a cleaner tree representation?
Often HTML is written with many syntactic mistakes - is there a default approach (which is used by common libraries) to correct/clean this loose structure?

Using CSS selectors with the additional option of only extracting an attribute would be ok. But maybe XPATH might be a better selection syntax for this.

possibly xmlstarlet can be used to format html to xml and then parsed. — Thufir
– Thufir, Commented Jul 30, 2017 at 19:11

aborruso · Accepted Answer · 2017-01-04 21:43:45Z

with

curl "http://clojurescript.net/" | scrape -be '//body/script' | xml2json | jq '.html.body.script[].src

you have

"http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-web.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-main.js"

The tools are:

the great jq https://stedolan.github.io/jq/;
scrape https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape;
xml2json https://github.com/Inist-CNRS/node-xml2json-command.

Or with:

curl "http://clojurescript.net/" | hxnormalize -x | hxselect -i 'body > script' |  grep -oP '(http:.*?)(")' | sed 's/"//g'

You have:

http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

Gilles 'SO- stop being evil' · Accepted Answer · 2016-07-11 23:59:38Z

I don't know of any standalone utility to parse HTML. There are utilities for XML but I don't think any of them would be easy to use.

Many programming languages have a library to parse HTML. Most Unix systems have Perl or Python. I recommend using Python's BeautifulSoup or Perl's HTML::TreeBuilder. If you prefer you can of course use another language (nokogiri in Ruby, etc.)

Here's a Python one-liner that combines the downloading with the parsing:

python2 -c 'import codecs, sys, urllib, BeautifulSoup; html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1])); sys.stdout.writelines([e["src"] + "\n" for e in html.findAll("script")])' http://clojurescript.net/

Or as a more readable few-liner:

python2 -c '
import codecs, sys, urllib, BeautifulSoup;
html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1]));
scripts = html.findAll("script");
for e in scripts: print(e["src"])
' http://clojurescript.net/

mgamba · Accepted Answer · 2017-05-05 01:26:41Z

Nokogiri has great command line functionality:

curl -Ls http://clojurescript.net/ | nokogiri -e 'puts $_.css("script").map{|e|e.attr("src")}'
http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

It lets you combine the simplicity of a single command-line tool that you're looking for with the straightforward approach of using a programming language that you're used to.

Stack Exchange Network

basic webscraping from the CLI

3 Answers 3

You must log in to answer this question.

Hot Network Questions

basic webscraping from the CLI

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions