Finding a random sentence in HTML with python regex

Question

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between tags.

I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between tags.

Can anyone help me out? Thanks.

Best I could come up with:

html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)

EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.

Beautiful soup is now my new favourite import, thank you @Floris — Rutger Semp
– Rutger Semp, Commented Jun 10, 2013 at 18:52
I am glad to hear it. It really is spectacularly good, isn't it. But what a crazy name... — Floris
– Floris, Commented Jun 10, 2013 at 19:28

Explosion Pills · Accepted Answer · 2013-04-27 00:07:46Z

1

You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use

re.findall('<br>(.*?)<br>', html, re.S)

However will return multiple results as there are a bunch of   on that page. You may want to use the more specific:

re.findall('<hr><br>(.*?)<br><hr>', html, re.S)

answered Apr 27, 2013 at 0:07

Explosion Pills

192k56 gold badges341 silver badges417 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

score 1 · Accepted Answer · 2013-04-27 00:14:47Z

from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)

if (len(output) > 0):
    print(output)
    output = re.sub('\n', ' ', output[0])
    output = re.sub('\t', '', output)
    print(output)

Terminal

imac2011:Desktop allendar$ python test.py 
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']

A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx

You could also strip of the final \n's and replace all those inside the text (on longer quotes) with   if you are displaying it in HTML again, so you would maintain the original line breaks visually.

Casimir et Hippolyte · Accepted Answer · 2013-04-27 00:51:24Z

0

All jokes of that page have the same model, no ambigous things, you can use this

output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)

No need to use the dotall flag cause there's no dot.

answered Apr 27, 2013 at 0:51

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Comments

Rutger Semp · Accepted Answer · 2019-11-18 10:27:50Z

0

This is uh, 7 years later, but for future reference:

Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

answered Nov 18, 2019 at 10:27

Rutger Semp

431 silver badge6 bronze badges

Collectives™ on Stack Overflow

Finding a random sentence in HTML with python regex

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related