2

I am trying to build a parser and save the results as an xml file but i have problems..

Would you experts please have a look at my code ?

Traceback :TypeError: expected string or buffer

import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL |  re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL |  re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
 for item in findtags2:
    art_elem = doc.createElement('artikel')
    countries.appendChild(art_elem)
    s = item.replace('<P>','')
    t = s.replace('</P>','')
    text_elem = doc.createTextNode(t)
    art_elem.appendChild(text_elem)    

print doc.toprettyxml()
5
  • Hi Peter; welcome to SO. Highlight code and press ctrl-k to have it properly formatted. I tried to remove some of the whitespace while hopefully preserving your code. If I've made any mistake please rollback. Commented May 22, 2010 at 10:57
  • Also, please post the traceback if you can; which will show the line where the error occurs. Thanks. Commented May 22, 2010 at 10:59
  • I'm guessing the error is here: re.compile('....').findall(soup) Commented May 22, 2010 at 11:00
  • You're probably right, Mark. But why should we have to guess when the OP can and should learn to use the provided debugging tools. Commented May 22, 2010 at 11:02
  • Sorry for being new at this.. Have tried to fix the post.. Apparently i can't do a reges on a soup'ed data Commented May 22, 2010 at 11:12

1 Answer 1

5

It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

Sign up to request clarification or add additional context in comments.

8 Comments

ah yes BUT it won't find the rest.. I have real problems getting BS to find the contents from within the tags..
@Peter Nielsen: Can you explain what you mean by 'it won't find the rest'? Does my update answer your question?
@Peter Nielsen: "how i find the contents inside the tags". Try this: for tag in soup.findAll('h1'): print tag.contents
Uhhhhh.. very , very , very nice.. I just got tingly all over.. ;-) Ty very much..
@Peter, since you like the answer you should upvote and accept it -- this is really fundamental SO etiquette!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.