How to parse text from html file

Question

import urllib2
import nltk
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup




l = """<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv3676|crp<br />VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRR<br />APDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEIS<br />EQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEI<br />AQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv3676.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv3676.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">"""

print l

I have one HTML line and want to extract the text which is embedded into the HTML tags. I have tried with all the available methods, but they are not working in my case.

How can I do it?

Expected output should be:

H37Rv|Rv3676|crp VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRRAPDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEISEQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEIAQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR

You will find very good examples of what you are trying to do here: crummy.com/software/BeautifulSoup/bs4/doc — dima
– dima, Commented Oct 4, 2016 at 8:04

Acepcs · Accepted Answer · 2016-10-04 07:48:21Z

1

I notice you import BeautifulSoup, so you can use BeautifulSoup to help you extract these information.

soup = BeautifulSoup(l,"html.parser")
print soup.get_text()

I've tried and it worked, but the sentence in the last tag will also be extracted, you have to cut the result if needed.

answered Oct 4, 2016 at 7:48

Acepcs

2741 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nyakiba · Accepted Answer · 2016-10-04 08:04:12Z

1

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

html = BeautifulSoup(l)
small = html.find_all('small')
print (small.get_text())

This gets the small tag and prints out all the text in it

answered Oct 4, 2016 at 8:04

Nyakiba

8628 silver badges18 bronze badges

Comments

user2935002 · Accepted Answer · 2016-10-04 09:48:23Z

I have tried with BeautifulSoup which was not working for me because it was producing unformatted version so i have decide to write down code with my self and its working absolutely fine and producing what i want.

import urllib2


proxy = urllib2.ProxyHandler({'http': 'http://******************'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
res = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv3676')
html = res.readlines()

for l in html:


    if "Genomic sequence" in l:
        l = l.split("</small>")[0]

        l  = l.split("<br />")
        header = l[0]
        sequence = l[1:]

        print "".join([">", header.split(">")[4]])
        print "".join(sequence)

output

>M. tuberculosis H37Rv|Rv3676|crp


VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRRAPDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEISEQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEIAQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR

Collectives™ on Stack Overflow

How to parse text from html file

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related