finding images in html source code with python

Question

I've got to find the images in a html source code. I'm using regex instead of html.parser because I know it better, but if you can explain to me how to use html parsing like you would a child, I'll be happy to go down that road too.

Can't use beautifulsoup, wish I could, but I got to learn to do this the hard way.

I've read through a lot of questions and answers on here on regex and html (example) so I'm aware of the feelings on this topic.

But hear me out!

Here's my coding attempt (Python 3):

import urllib.request
import re

website = urllib.request.urlopen('http://google.com')
html = website.read()
pat = re.compile (r'<img [^>]*src="([^"]+)')
img = pat.findall(html)

I double checked my regex on regex101.com and it works at finding the img link, but when I run it on IDLE, I get a syntax error and keeps highlighting the caret. Why?

I'm headed in the right direction... yes?

update: Hi, I was thinking may be I get short quick answer, but it seems I may touched a nerve in the community.

I am definitely new and terrible at programming, no way around that. I've been reading all the comments and I really appreciate all the help and patience users have shown me.

You're getting a syntax error because... this is invalid syntax (hint: re.compile expects a string). But you should just take a look at the BeautifulSoup html parser, there's enough examples on here and elsewhere that should get you started. — l4mpi
– l4mpi, Commented Oct 20, 2013 at 12:31
@user2799617 The person has asked a valid question, showed us what he's tried, and checked it on regex101 (which we need a link of). I highly doubt that he has done anything wrong. — Nafiul Islam
– Nafiul Islam, Commented Oct 20, 2013 at 12:35
@pythonintraining For the gz issue, I guess you're using Windows. Install a utility like 7Zip. — nanofarad
– nanofarad, Commented Oct 20, 2013 at 12:37
hey user2799617, you don't need to ride me, i already ride myself hard enough. i thought the point of stackoverflow was to help people like me, go to reddit or craigslist if you want to keep on ranting. — pythonintraining
– pythonintraining, Commented Oct 20, 2013 at 12:39

Stefano Sanfilippo · Accepted Answer · 2013-10-20 13:09:18Z

3

There is nothing wrong with the regex, you are missing two things:

Python does not have a regex type, so you have to wrap it in a string. Use a raw string so that the string is passed as-is to the regex compiler, without any escape interpretation
The result of the .read() call is a byte sequence, not a string. So you need a byte sequence regex.

The second one is Python3-specific (and I see that you are using Py3)

Putting all together, just fix the aforementioned line like this:

pat = re.compile (rb'<img [^>]*src="([^"]+)')

r stands for raw and b for byte sequence.

Additionally, test on a website that actually embeds images in <img> tags, like http://stackoverflow.com. You will not find anything when processing http://google.com

Here we go:

Python 3.3.2+
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import re
>>> website = urllib.request.urlopen('http://stackoverflow.com/')
>>> html = website.read()
>>> pat = re.compile (rb'<img [^>]*src="([^"]+)')
>>> img = pat.findall(html)
>>> img
[b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/dmHl0.png', b'https://i.sstatic.net/dmHl0.png', b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/6QN0y.png', b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/L8rHf.png', b'https://i.sstatic.net/tKsDb.png', b'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']

answered Oct 20, 2013 at 13:09

Stefano Sanfilippo

33.2k7 gold badges85 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pythonintraining Over a year ago

Thanks! I knew I was close!

Nafiul Islam · Accepted Answer · 2013-10-20 13:17:46Z

1

Instead of using urllib, I used requests, you can download it from here. They do the same thing, I just like requests better since it has a better API. The regex string is only slightly changed. \s is just added in case there are a few whites spaces before the img tag. You were headed in the right direction. You can find out more about the re module here.

Here is the code

import requests
import re

website = requests.get('http://stackoverflow.com//')
html = website.text
pat = re.compile(r'<\s*img [^>]*src="([^"]+)')
img = pat.findall(html)

print img

And the output:

[u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/L8rHf.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/Ryr18.png', u'https://i.sstatic.net/ASf0H.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/Ryr18.png', u'https://i.sstatic.net/VgvXl.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/6QN0y.png', u'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']

edited Oct 20, 2013 at 13:17

answered Oct 20, 2013 at 12:56

Nafiul Islam

82.9k33 gold badges145 silver badges202 bronze badges

1 Comment

Fred Mitchell Over a year ago

I will add one suggestion. This answer is good. The question would have been valid without any code to retrieve a web page. In the future, it might be worthwhile to make a function that finds what you want from a string or array of bytes. Then the function has only a single concern, finding a list of images.

mislavcimpersak · Accepted Answer · 2013-10-20 13:05:41Z

0

re.compile (r'<img [^>]*src="([^"]+)')

you are missing the quotation marks (single or double) around the pattern

edited Oct 20, 2013 at 13:05

answered Oct 20, 2013 at 12:40

mislavcimpersak

3,0382 gold badges30 silver badges30 bronze badges

8 Comments

l4mpi Over a year ago

"and just to be sure it's good to escape quotation marks within the expresion" - what? That's more than wrong in this case...

pythonintraining Over a year ago

agreed, but thanks for catching the missing quotation marks. now my error reads as: TypeError: can't use a string pattern on a bytes-like object

mislavcimpersak Over a year ago

it's a general remark regarding regex. in his case of parsing html he should catch both single and double quotation marks, but that is his job to do

l4mpi Over a year ago

@mislav do you know what the r in front of the string means? "escaping" the quotation marks should only be done if they actually need to be escaped. Your regex matches \" instead of just the ".

mislavcimpersak Over a year ago

i'm changing the answer just not do derail someone in the future to just include the remark about the missing quotes. worrying about quotes inside the regex for html is a whole new issue

|

Collectives™ on Stack Overflow

finding images in html source code with python

3 Answers 3

1 Comment

1 Comment

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related