I am trying to parse a XML file with Python 2.7. The size of the XML file is 370+ MB, and contains 6,541,000 rows.
The XML file is composed of 300K of following blocks:
<Tag:Member>
<fileID id = '123456789'>
<miscTag> 123 </miscTag>
<miscTag2> 456 </miscTag2>
<DateTag> 2008-02-02 </DateTag>
<Tag2:descriptiveTerm>Keyword_1</Tag2:descriptiveTerm>
<miscTag3>6.330016</miscTag3>
<historyTag>
<DateTag>2001-04-16</DateTag>
<reasonTag>Refresh</reasonTag>
</historyTag>
<Tag3:make>Keyword_2</Tag3:make>
<miscTag4>
<miscTag5>
<Tag4:coordinates>6.090,6.000 5.490,4.300 6.090,6.000 </Tag4:coordinates>
</miscTag5>
</miscTag4>
</Tag:Member>
I used following code:
from xml.dom.minidom import parseString
def XMLParser(filePath):
""" ===== Load XML File into Memory ===== """
datafile = open(filePath)
data = datafile.read()
datafile.close()
dom = parseString(data)
length = len(dom.getElementsByTagName("Tag:Member"))
counter = 0
while counter < length:
""" ===== Extract Descriptive Term ===== """
contentString = dom.getElementsByTagName("Tag2:descriptiveTerm")[counter].toxml()
laterpart = contentString.split("Tag2:descriptiveTerm>", 1)[1]
descriptiveTerm = laterpart.split("</Tag2:descriptiveTerm>", 1)[0]
if descriptiveGroup == "Keyword_1":
""" ===== Extract Make ===== """
contentString = dom.getElementsByTagName("Tag3:make")[counter].toxml()
laterpart = contentString.split("<Tag3:make>", 1)[1]
make = laterpart.split("</Tag3:make>", 1)[0]
if descriptiveTerm == "Keyword_1" and make == "Keyword_2":
""" ===== Extract ID ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("id=\"", 1)[1]
laterpart = laterpart.split("Tag", 1)[1]
IDString = laterpart.split("\">", 1)[0]
""" ===== Extract Coordinates ===== """
contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()
laterpart = contentString.split("coordinates>", 1)[1]
coordString = laterpart.split(" </Tag4:coordinates>", 1)[0]
counter += 1
So, I've run this, and found that it takes about 27GB of the memory, and parsing each of the above blocks taks more than 20 seconds. So it will take 2 months to parse this file!
I guess I've wrote some poor efficiency code. Can anyone help me to improve it?
Many thanks.