1

I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here.

The source files are here: a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations. I have been trying to use "-".join(doc.xpath to tie everything together when its parsed out but the output creates blanks separated by hyphens for the <document-id> and <classification-national> shown below

<references-cited> <citation> 
<patcit num="00001"> <document-id>
<country>US</country> 
<doc-number>534632</doc-number> 
<kind>A</kind>
<name>Coleman</name> 
<date>18950200</date> 
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>

Note not all children exist within each <citation>, sometimes they are not present at all.

How can I parse this xpath while trying to place hyphens between each data entry for multiple entries under <citation> ?

2
  • Please trim down your question. If I understand you, the problem is building a correct XPath expression with Python. Commented Feb 27, 2012 at 20:56
  • I've edited the question again for ease to read. Commented Feb 27, 2012 at 21:13

1 Answer 1

1

From this XML (references.xml),

<references-cited> 
  <citation> 
    <patcit num="00001"> 
      <document-id>
        <country>US</country> 
        <doc-number>534632</doc-number> 
        <kind>A</kind>
        <name>Coleman</name> 
        <date>18950200</date> 
      </document-id> 
    </patcit>
    <category>cited by examiner</category>
    <classification-national>
      <country>US</country>
      <main-classification>249127</main-classification>
    </classification-national>
  </citation>

  <citation>
    <patcit num="00002">
      <document-id>
        <country>US</country>
        <doc-number>D28957</doc-number>
        <kind>S</kind>
        <name>Simon</name>
        <date>18980600</date>
      </document-id>
    </patcit>
    <category>cited by other</category>
  </citation>
</references-cited>

you can get the text content of every descendant of <citation> that has any content as follows:

from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            print "%s: %s"  %(d.tag, d.text)
    print

Output:

country: US
doc-number: 534632
kind: A
name: Coleman
date: 18950200
category: cited by examiner
country: US
main-classification: 249127

country: US
doc-number: D28957
kind: S
name: Simon
date: 18980600
category: cited by other

This variation:

import sys
from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            sys.stdout.write("-%s"  %(d.text))
    print

results in this output:

-US-534632-A-Coleman-18950200-cited by examiner-US-249127
-US-D28957-S-Simon-18980600-cited by other
Sign up to request clarification or add additional context in comments.

7 Comments

I'm sensing we are close but this method didn't yield any output when I tried it. I have editted my question above to properly explain further my issue.
Did you try my example exactly as it is, or did you try to use my Python code on another XML document?
I used it on another XML. Just the last 4 lines you gave me starting at descs = doc.xpath
I believe the reason there is an issue is because the xml I'm using is based in a .zip and has multiple xml declarations within it. It just this way with this file type which is why the python created by MattH cited at the top was able to parse through. Also when the xml was edited properly your python ran completely as is but no output was printed within the shell or in a separate file
Your question was about an XPath problem. You can't evaluate XPath expressions on "non-standard" XML. My example works.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.