I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here.
The source files are here:
a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations. I have been trying to use "-".join(doc.xpath to tie everything together when its parsed out but the output creates blanks separated by hyphens for the <document-id> and <classification-national> shown below
<references-cited> <citation>
<patcit num="00001"> <document-id>
<country>US</country>
<doc-number>534632</doc-number>
<kind>A</kind>
<name>Coleman</name>
<date>18950200</date>
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>
Note not all children exist within each <citation>, sometimes they are not present at all.
How can I parse this xpath while trying to place hyphens between each data entry for multiple entries under <citation> ?