Python Regex - Retrieving from XML File

Question

I'm currently having issues with regular expressions. I'm trying to extract the name of an item from an XML file: https://www.crimezappers.com/rss/catalog/category/cid/97/store_id/1/. I have found a method, however, it is very clunky, I was wondering if there was a way to make the expression smaller?

This is what I currently have (long way):

<item>\n<title>\n<!\[CDATA\[ ([A-Za-z].[^\]]+)|<item>\n<title>\n<!\[CDATA\[\n([A-Za-z].[^\]]+)

This is my attempt at doing it:

<item>\n<title>\n<!\[CDATA\[|(?\n)| |([A-Za-z].[^\]]+)

Image of what should be found, the blue underline is what should be also found

For parsing xml, I suggest to use exist lxml parsing library, instead of using regex directly. — Kir Chou
– Kir Chou, Commented May 15, 2017 at 4:59

falsetru · Accepted Answer · 2017-05-15 06:28:07Z

2

Using regular expression to parse xml is not a good idea.

Use xml processing library like lxml:

>>> import requests
>>> import lxml.etree
>>> 
>>> r = requests.get('https://www.crimezappers.com/rss/...')
>>> root = lxml.etree.fromstring(r.content)
>>> root.xpath('//item/title/text()')
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]

UPDATE Using regular expression.

You can use \s to match any space characters (including newline character \n):

>>> re.findall(r'<item>\s*<title>\s*<!\[CDATA\[\s*(.*?)\s*\]\]>', r.content)
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]

Replaced [A-Za-z].[^\]]+ with (.*?)\]\]> to match everything between <![CDATA and ]]>, non-greedily (?).

edited May 15, 2017 at 6:28

answered May 15, 2017 at 4:58

falsetru

371k69 gold badges770 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Regex - Retrieving from XML File

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related