Parsing XPath within non standard XML using lxml Python

Question

I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here.

The source files are here: a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations. I have been trying to use "-".join(doc.xpath to tie everything together when its parsed out but the output creates blanks separated by hyphens for the <document-id> and <classification-national> shown below

<references-cited> <citation> 
<patcit num="00001"> <document-id>
<country>US</country> 
<doc-number>534632</doc-number> 
<kind>A</kind>
<name>Coleman</name> 
<date>18950200</date> 
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>

Note not all children exist within each <citation>, sometimes they are not present at all.

How can I parse this xpath while trying to place hyphens between each data entry for multiple entries under <citation> ?

Please trim down your question. If I understand you, the problem is building a correct XPath expression with Python. — user647772
– user647772, Commented Feb 27, 2012 at 20:56

mzjn · Accepted Answer · 2012-02-27 21:45:18Z

1

From this XML (references.xml),

<references-cited> 
  <citation> 
    <patcit num="00001"> 
      <document-id>
        <country>US</country> 
        <doc-number>534632</doc-number> 
        <kind>A</kind>
        <name>Coleman</name> 
        <date>18950200</date> 
      </document-id> 
    </patcit>
    <category>cited by examiner</category>
    <classification-national>
      <country>US</country>
      <main-classification>249127</main-classification>
    </classification-national>
  </citation>

  <citation>
    <patcit num="00002">
      <document-id>
        <country>US</country>
        <doc-number>D28957</doc-number>
        <kind>S</kind>
        <name>Simon</name>
        <date>18980600</date>
      </document-id>
    </patcit>
    <category>cited by other</category>
  </citation>
</references-cited>

you can get the text content of every descendant of <citation> that has any content as follows:

from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            print "%s: %s"  %(d.tag, d.text)
    print

Output:

country: US
doc-number: 534632
kind: A
name: Coleman
date: 18950200
category: cited by examiner
country: US
main-classification: 249127

country: US
doc-number: D28957
kind: S
name: Simon
date: 18980600
category: cited by other

This variation:

import sys
from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            sys.stdout.write("-%s"  %(d.text))
    print

results in this output:

-US-534632-A-Coleman-18950200-cited by examiner-US-249127
-US-D28957-S-Simon-18980600-cited by other

edited Feb 27, 2012 at 21:45

answered Feb 27, 2012 at 19:37

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Hola Sir Over a year ago

I'm sensing we are close but this method didn't yield any output when I tried it. I have editted my question above to properly explain further my issue.

mzjn Over a year ago

Did you try my example exactly as it is, or did you try to use my Python code on another XML document?

Hola Sir Over a year ago

I used it on another XML. Just the last 4 lines you gave me starting at descs = doc.xpath

Hola Sir Over a year ago

I believe the reason there is an issue is because the xml I'm using is based in a .zip and has multiple xml declarations within it. It just this way with this file type which is why the python created by MattH cited at the top was able to parse through. Also when the xml was edited properly your python ran completely as is but no output was printed within the shell or in a separate file

mzjn Over a year ago

Your question was about an XPath problem. You can't evaluate XPath expressions on "non-standard" XML. My example works.

|

Collectives™ on Stack Overflow

Parsing XPath within non standard XML using lxml Python

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related