2

I'm using lxml as follows to parse an exported XML file from another system:

xmldoc = open(filename)
etree.parse(xmldoc)

But im getting:

lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46

Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?

Edit: I had forgotten to include my DTD in the same folder - it's there now and has the following declaration:

<!ENTITY eacute "&#233;">

and is referred to (and always was) in xmldoc as so:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE DScribeDatabase SYSTEM "foo.dtd">

Yet I still get the same problem ... does the DTD need to be declared in Python too?

Jon Hadley
  • 5,196
  • 8
  • 41
  • 65

1 Answers1

6

eacute is not a predefined entity in XML. To include an &eacute; entity reference in an XML file, it must have a <!DOCTYPE> declaration pointing to a DTD (such as an XHTML 1.0 DTD) that defines the entity.

If the XML uses &eacute; but doesn't have a <!DOCTYPE>, it is not well-formed and the system that exported it needs to be fixed.

(There isn't a good reason to use an entity reference to represent é in an XML file. The character reference &#233; is understood everywhere without entity definitions, if the file can't simply include a raw UTF-8 é for some reason.)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • I've tried adding the dtd, the file, but not doctype, for which was missing. But i still get the same error. – Jon Hadley May 17 '10 at 16:16
  • Make sure you're using a `etree.XMLParser(load_dtd= True)` (passed to `etree.parse()`) to make it actually use the DTD. – bobince May 17 '10 at 16:37