2

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.

I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.

Can anyone help me out? Thanks.

Best I could come up with:

html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)

EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.

3
  • Google "beautiful soup" and you will be enlightened... Commented Apr 27, 2013 at 3:29
  • 1
    Beautiful soup is now my new favourite import, thank you @Floris Commented Jun 10, 2013 at 18:52
  • 1
    I am glad to hear it. It really is spectacularly good, isn't it. But what a crazy name... Commented Jun 10, 2013 at 19:28

4 Answers 4

1

You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use

re.findall('<br>(.*?)<br>', html, re.S)

However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:

re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
Sign up to request clarification or add additional context in comments.

Comments

1
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)

if (len(output) > 0):
    print(output)
    output = re.sub('\n', ' ', output[0])
    output = re.sub('\t', '', output)
    print(output)

Terminal

imac2011:Desktop allendar$ python test.py 
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']

A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx

You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.

Comments

0

All jokes of that page have the same model, no ambigous things, you can use this

output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)

No need to use the dotall flag cause there's no dot.

Comments

0

This is uh, 7 years later, but for future reference:

Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.