0

I'm currently having issues with regular expressions. I'm trying to extract the name of an item from an XML file: https://www.crimezappers.com/rss/catalog/category/cid/97/store_id/1/. I have found a method, however, it is very clunky, I was wondering if there was a way to make the expression smaller?

This is what I currently have (long way):

<item>\n<title>\n<!\[CDATA\[ ([A-Za-z].[^\]]+)|<item>\n<title>\n<!\[CDATA\[\n([A-Za-z].[^\]]+)

This is my attempt at doing it:

<item>\n<title>\n<!\[CDATA\[|(?\n)| |([A-Za-z].[^\]]+)

Image of what should be found, the blue underline is what should be also found

1
  • 1
    For parsing xml, I suggest to use exist lxml parsing library, instead of using regex directly. Commented May 15, 2017 at 4:59

1 Answer 1

2

Using regular expression to parse xml is not a good idea.

Use xml processing library like lxml:

>>> import requests
>>> import lxml.etree
>>> 
>>> r = requests.get('https://www.crimezappers.com/rss/...')
>>> root = lxml.etree.fromstring(r.content)
>>> root.xpath('//item/title/text()')
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]

UPDATE Using regular expression.

You can use \s to match any space characters (including newline character \n):

>>> re.findall(r'<item>\s*<title>\s*<!\[CDATA\[\s*(.*?)\s*\]\]>', r.content)
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]
  • Replaced [A-Za-z].[^\]]+ with (.*?)\]\]> to match everything between <![CDATA and ]]>, non-greedily (?).
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.