-1

This is the XML DTD (at least I think it is the DTD, I am not that versed in XML so please correct me if I am wrong):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE PATDOC SYSTEM "-US-Grant-025xml.dtdST32" [
<!ENTITY USD0484671-20040106-D00000.TIF SYSTEM "USD0484671-20040106-D00000.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00001.TIF SYSTEM "USD0484671-20040106-D00001.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00002.TIF SYSTEM "USD0484671-20040106-D00002.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00003.TIF SYSTEM "USD0484671-20040106-D00003.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00004.TIF SYSTEM "USD0484671-20040106-D00004.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00005.TIF SYSTEM "USD0484671-20040106-D00005.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00006.TIF SYSTEM "USD0484671-20040106-D00006.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00007.TIF SYSTEM "USD0484671-20040106-D00007.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00008.TIF SYSTEM "USD0484671-20040106-D00008.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00009.TIF SYSTEM "USD0484671-20040106-D00009.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00010.TIF SYSTEM "USD0484671-20040106-D00010.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00011.TIF SYSTEM "USD0484671-20040106-D00011.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00012.TIF SYSTEM "USD0484671-20040106-D00012.TIF" NDATA TIF>
]>
<PATDOC DTD="2.5" STATUS="Build 20030724">

I get the following error when I try to run my python parser

Traceback (most recent call last):
  File "C:\Users\John\Desktop\FINAL BART ALL INFO-Magic Bullet.py", line 75, in <module>
    doc = etree.XML(item)
  File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448)
  File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)
  File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)
  File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
XMLSyntaxError: Entity 'num' not defined, line 166, column 84

This takes Patent XML data and parses it out into a delimited file. also, I used "import urllib2, os, zipfile from lxml import etree"

Johnny B
  • 420
  • 1
  • 5
  • 14
  • did you check the XML file (on line 166, column 84)? – Gonzalo Nov 07 '12 at 16:07
  • This is how it appears in the xml file `&num;` the semi colon would be column 83... That being said, I do not not know much about XML or python, this was an application that I inherited, I am in charge of the database design with the flat files. – Johnny B Nov 07 '12 at 16:17
  • Is this the same problem? [lxml unicode entity parse problems](http://stackoverflow.com/q/2835077/222914) – Janne Karila Nov 07 '12 at 19:47

1 Answers1

0

&num; is an entity for '#', but lxml thinks it isn't well formed XML.

Check the DTD for the file to see whether it allows entities - if there's no DTD that's part of the problem.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
James Thiele
  • 393
  • 3
  • 9
  • I edited it with, what I think, is the DTD thing. Again XML and Python are not my specialty, I just need to parse 2 more years of patent data and then I can build my DB... Thanks again for any help you can give. – Johnny B Nov 11 '12 at 22:52