xpath for parsing in Python

Question

I've written a simple parser in Python for this website. Below is part of my code.
My questions are:

How could I extract not only p[1] but also the rest p[2],p[3]...
How Can I separate them from each other?

text1 = xmldata.xpath('//p[@class="MsoNormal"][1]//text()')  
a=''  
for i in text1:  
a=a+i.encode('cp1251')  
print a

can you share more of your code? what package are you using? lxml? — paul trmbrth
– paul trmbrth, Commented Oct 8, 2013 at 11:47
Here is the beginning of my code. import urllib import lxml.html page1 = urllib.urlopen('toponymic-dictionary.in.ua/…) pageWritten = page1.read() pageReady = pageWritten.decode('utf-8') xmldata = lxml.html.document_fromstring(pageReady) — Khrystyna Pyurkovska
– Khrystyna Pyurkovska, Commented Oct 8, 2013 at 18:18

Charles Duffy · Accepted Answer · 2013-10-08 21:45:49Z

2

Simply remove the [1] to stop filtering, and your return value will be a list, which you can pass to ''.join() to concatenate (or '\n'.join() if you want newlines between each string).

text_sections = xmldata.xpath('//p[@class="MsoNormal"]//text()')
print u'\n'.join(text_sections).encode('cp1251')

edited Oct 8, 2013 at 21:45

answered Oct 8, 2013 at 12:22

Charles Duffy

299k43 gold badges442 silver badges498 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Khrystyna Pyurkovska Over a year ago

Actually I want newlines between each paragragh. In the result I want each p[1],p[2],... to be separated from each other.

Charles Duffy Over a year ago

@KhrystynaPyurkovska, that's what the code I provide above will do.

paul trmbrth · Accepted Answer · 2013-10-08 20:57:14Z

You can use lxml.html.parse() function that accepts file-like objects, such as what urllib.urlopen() returns. See lxml documentation on that.

Then, as @CharlesDuffy suggests, you can use u'\n'.join() to concatenate all text elements within the p elements you select, with newlines \n

Also, I would suggest working with unicode strings all along, until you need to print or write to file.

import urllib
import lxml.html

page = urllib.urlopen('http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=1&Itemid=2')

# use "page" as a file-like object
xmldata = lxml.html.parse(page).getroot()

ptexts = xmldata.xpath('//p[@class="MsoNormal"]//text()')
joined_text = u'\n'.join(ptexts)

print joined_text.encode('cp1251')

sukhmel · Accepted Answer · 2013-10-08 12:11:01Z

0

without knowing of any background, I can suggest only such:

texts = list();
index = 0;
while(True):
    index += 1;
    try:
        temp = xmldata.xpath('//p[@class="MsoNormal"][%i]//text()' % index);
    except:
        break;
    else:
        texts.append();

after this block of code you will have a list of same elements as your text1

answered Oct 8, 2013 at 12:11

sukhmel

1,49116 silver badges33 bronze badges

Collectives™ on Stack Overflow

xpath for parsing in Python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related