Python regex on list

Question

I am trying to build a parser and save the results as an xml file but i have problems..

Would you experts please have a look at my code ?

Traceback :TypeError: expected string or buffer

import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL |  re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL |  re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
 for item in findtags2:
    art_elem = doc.createElement('artikel')
    countries.appendChild(art_elem)
    s = item.replace('<P>','')
    t = s.replace('</P>','')
    text_elem = doc.createTextNode(t)
    art_elem.appendChild(text_elem)    

print doc.toprettyxml()

Hi Peter; welcome to SO. Highlight code and press ctrl-k to have it properly formatted. I tried to remove some of the whitespace while hopefully preserving your code. If I've made any mistake please rollback. — mechanical_meat
– mechanical_meat, Commented May 22, 2010 at 10:57
Also, please post the traceback if you can; which will show the line where the error occurs. Thanks. — mechanical_meat
– mechanical_meat, Commented May 22, 2010 at 10:59
I'm guessing the error is here: re.compile('....').findall(soup) — Mark Byers
– Mark Byers, Commented May 22, 2010 at 11:00
You're probably right, Mark. But why should we have to guess when the OP can and should learn to use the provided debugging tools. — mechanical_meat
– mechanical_meat, Commented May 22, 2010 at 11:02
Sorry for being new at this.. Have tried to fix the post.. Apparently i can't do a reges on a soup'ed data — Peter Nielsen
– Peter Nielsen, Commented May 22, 2010 at 11:12

Mark Byers · Accepted Answer · 2010-05-22 17:52:54Z

5

It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

edited May 22, 2010 at 17:52

answered May 22, 2010 at 10:59

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Peter Nielsen Over a year ago

ah yes BUT it won't find the rest.. I have real problems getting BS to find the contents from within the tags..

Mark Byers Over a year ago

@Peter Nielsen: Can you explain what you mean by 'it won't find the rest'? Does my update answer your question?

Mark Byers Over a year ago

@Peter Nielsen: "how i find the contents inside the tags". Try this: for tag in soup.findAll('h1'): print tag.contents

Peter Nielsen Over a year ago

Uhhhhh.. very , very , very nice.. I just got tingly all over.. ;-) Ty very much..

Alex Martelli Over a year ago

@Peter, since you like the answer you should upvote and accept it -- this is really fundamental SO etiquette!

|

Collectives™ on Stack Overflow

Python regex on list

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related