1
import urllib2
import nltk
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup




l = """<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv3676|crp<br />VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRR<br />APDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEIS<br />EQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEI<br />AQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv3676.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv3676.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">"""

print l

I have one HTML line and want to extract the text which is embedded into the HTML tags. I have tried with all the available methods, but they are not working in my case.

How can I do it?

Expected output should be:

H37Rv|Rv3676|crp VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRRAPDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEISEQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEIAQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR

1

3 Answers 3

1

I notice you import BeautifulSoup, so you can use BeautifulSoup to help you extract these information.

soup = BeautifulSoup(l,"html.parser")
print soup.get_text()

I've tried and it worked, but the sentence in the last tag will also be extracted, you have to cut the result if needed.

Sign up to request clarification or add additional context in comments.

Comments

1
try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

html = BeautifulSoup(l)
small = html.find_all('small')
print (small.get_text())

This gets the small tag and prints out all the text in it

Comments

1

I have tried with BeautifulSoup which was not working for me because it was producing unformatted version so i have decide to write down code with my self and its working absolutely fine and producing what i want.

import urllib2


proxy = urllib2.ProxyHandler({'http': 'http://******************'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
res = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv3676')
html = res.readlines()

for l in html:


    if "Genomic sequence" in l:
        l = l.split("</small>")[0]

        l  = l.split("<br />")
        header = l[0]
        sequence = l[1:]

        print "".join([">", header.split(">")[4]])
        print "".join(sequence)

output

>M. tuberculosis H37Rv|Rv3676|crp


VDEILARAGIFQGVEPSAIAALTKQLQPVDFPRGHTVFAEGEPGDRLYIIISGKVKIGRRAPDGRENLLTIMGPSDMFGELSIFDPGPRTSSATTITEVRAVSMDRDALRSWIADRPEISEQLLRVLARRLRRTNNNLADLIFTDVPGRVAKQLLQLAQRFGTQEGGALRVTHDLTQEEIAQLVGASRETVNKALADFAHRGWIRLEGKSVLISDSERLARRAR

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.