1

I've written a simple parser in Python for this website. Below is part of my code.
My questions are:

  1. How could I extract not only p[1] but also the rest p[2],p[3]...
  2. How Can I separate them from each other?

text1 = xmldata.xpath('//p[@class="MsoNormal"][1]//text()')  
a=''  
for i in text1:  
a=a+i.encode('cp1251')  
print a
2
  • 2
    can you share more of your code? what package are you using? lxml? Commented Oct 8, 2013 at 11:47
  • Here is the beginning of my code. import urllib import lxml.html page1 = urllib.urlopen('toponymic-dictionary.in.ua/…) pageWritten = page1.read() pageReady = pageWritten.decode('utf-8') xmldata = lxml.html.document_fromstring(pageReady) Commented Oct 8, 2013 at 18:18

3 Answers 3

2

Simply remove the [1] to stop filtering, and your return value will be a list, which you can pass to ''.join() to concatenate (or '\n'.join() if you want newlines between each string).

text_sections = xmldata.xpath('//p[@class="MsoNormal"]//text()')
print u'\n'.join(text_sections).encode('cp1251')
Sign up to request clarification or add additional context in comments.

2 Comments

Actually I want newlines between each paragragh. In the result I want each p[1],p[2],... to be separated from each other.
@KhrystynaPyurkovska, that's what the code I provide above will do.
1

You can use lxml.html.parse() function that accepts file-like objects, such as what urllib.urlopen() returns. See lxml documentation on that.

Then, as @CharlesDuffy suggests, you can use u'\n'.join() to concatenate all text elements within the p elements you select, with newlines \n

Also, I would suggest working with unicode strings all along, until you need to print or write to file.

import urllib
import lxml.html

page = urllib.urlopen('http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=1&Itemid=2')

# use "page" as a file-like object
xmldata = lxml.html.parse(page).getroot()

ptexts = xmldata.xpath('//p[@class="MsoNormal"]//text()')
joined_text = u'\n'.join(ptexts)

print joined_text.encode('cp1251')

Comments

0

without knowing of any background, I can suggest only such:

texts = list();
index = 0;
while(True):
    index += 1;
    try:
        temp = xmldata.xpath('//p[@class="MsoNormal"][%i]//text()' % index);
    except:
        break;
    else:
        texts.append();

after this block of code you will have a list of same elements as your text1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.