Skip to content

ahobsonsayers/html-table-parser-python3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html-table-parser-python3.5+

This module consists of just one small class. Its purpose is to parse HTML tables without help of external modules. Everything used is part of python 3.

Installation

pip install html-table-parser-python3

How to use

Example Usage:

import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser


def url_get_contents(url):
    """ Opens a website and read its binary contents (HTTP Response Body) """
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
    return f.read()


def main():
    url = 'http://www.twitter.com'
    xhtml = url_get_contents(url).decode('utf-8')

    p = HTMLTableParser()
    p.feed(xhtml)
    pprint(p.tables)


if __name__ == '__main__':
    main()

The parser returns a nested lists of tables containing rows containing cells as strings. Tags in cells are stripped and the tags text content is joined. The console output for parsing all tables on the twitter home page looks like this:

>>>
[[['', 'Anmelden']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['Zeige SMS-Kurzwahlen für andere Länder']]]

CLI

There is also a command line interface which you can use directly to generate a CSV:

./html_table_converter -u http://web.archive.org/web/20180524092138/http://metal-train.de/index.php/fahrplan.html -o metaltrain

Credit

All Credit goes to Josua Schmid (schmijos). This is all his work, I just uploaded it to PyPi. Original repository can be found at:

https://github.com/schmijos/html-table-parser-python3

License

GNU GPL v3

About

A small and simple HTML table parser not requiring any external dependency.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%