1

I am trying to parse a XML file with Python 2.7. The size of the XML file is 370+ MB, and contains 6,541,000 rows.

The XML file is composed of 300K of following blocks:

<Tag:Member>
    <fileID id = '123456789'>
    <miscTag> 123 </miscTag>
    <miscTag2> 456 </miscTag2>
    <DateTag> 2008-02-02 </DateTag>
    <Tag2:descriptiveTerm>Keyword_1</Tag2:descriptiveTerm>
    <miscTag3>6.330016</miscTag3>
    <historyTag>
        <DateTag>2001-04-16</DateTag>
        <reasonTag>Refresh</reasonTag>
    </historyTag>
    <Tag3:make>Keyword_2</Tag3:make>
    <miscTag4>
            <miscTag5>
                <Tag4:coordinates>6.090,6.000 5.490,4.300 6.090,6.000 </Tag4:coordinates>
            </miscTag5>
        </miscTag4>
</Tag:Member>

I used following code:

from xml.dom.minidom import parseString

def XMLParser(filePath):    
    """ ===== Load XML File into Memory ===== """
    datafile = open(filePath)
    data = datafile.read()
    datafile.close()
    dom = parseString(data)    

    length = len(dom.getElementsByTagName("Tag:Member"))


    counter = 0
    while counter < length:
        """ ===== Extract Descriptive Term ===== """
        contentString = dom.getElementsByTagName("Tag2:descriptiveTerm")[counter].toxml()

        laterpart = contentString.split("Tag2:descriptiveTerm>", 1)[1]

        descriptiveTerm = laterpart.split("</Tag2:descriptiveTerm>", 1)[0]    


        if descriptiveGroup == "Keyword_1":
            """ ===== Extract Make ===== """
            contentString = dom.getElementsByTagName("Tag3:make")[counter].toxml()

            laterpart = contentString.split("<Tag3:make>", 1)[1]

            make = laterpart.split("</Tag3:make>", 1)[0]



            if descriptiveTerm == "Keyword_1" and make == "Keyword_2":
                """ ===== Extract ID ===== """        
                contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()

                laterpart = contentString.split("id=\"", 1)[1]

                laterpart = laterpart.split("Tag", 1)[1]

                IDString = laterpart.split("\">", 1)[0]



                """ ===== Extract Coordinates ===== """
                contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()

                laterpart = contentString.split("coordinates>", 1)[1]

                coordString = laterpart.split(" </Tag4:coordinates>", 1)[0]            


        counter += 1

So, I've run this, and found that it takes about 27GB of the memory, and parsing each of the above blocks taks more than 20 seconds. So it will take 2 months to parse this file!

I guess I've wrote some poor efficiency code. Can anyone help me to improve it?

Many thanks.

ChangeMyName
  • 7,018
  • 14
  • 56
  • 93
  • Converting back from DOM to XML is indeed needless and inefficient, and using string-splitting to navigate XML is downright horrid. Frankly, folks generally don't use minidom anymore at all unless they're trying to run code written for much older versions of Python, so the gist of the advice would be "don't do that". :) – Charles Duffy Feb 10 '15 at 17:16
  • 1
    I'd strongly (strongly!) suggest using a modern library based on a libxml2; lxml fits the bill, though cElementTree is workable also. And don't ever, ever parse your XML with string-splitting on syntax elements. – Charles Duffy Feb 10 '15 at 17:17
  • Actually, for a file of that size, a streaming parser is probably the better choice for efficiency purposes. – Charles Duffy Feb 10 '15 at 17:19
  • BTW, `Tag1` and `Tag2` are not tags but namespaces, they needs `xmlns` declarations somewhere in the parent document to be valid syntax, which you aren't giving in your example. – Charles Duffy Feb 10 '15 at 17:24

1 Answers1

1

For a file of this size, the correct approach is a streaming parser (SAX-style, not DOM-style, so minidom is entirely inappropriate). See this answer for notes on using lxml.iterparse (a recent/modern streaming parser which uses libxml2 -- a fast and efficient XML-parsing library written in C -- on its backend) in a memory-efficient way, or the article on which that answer is based.

In general -- as you see elements associated with a member, you should build that member up in memory, and when you see an event associated with the end of the tag, then you emit or process the built-up in-memory content and start a fresh new one.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441