One of the simulation software I use spit out a result file in XML format.
The data are collected in <Step>
elements. Each step has a string of numbers, either space or new line separated. For large files I could easily have thousands of <Step>
to parse.
Here is an example of the XML file:
<Step type="dynamic">
0.002
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>
<Step type="dynamic">
0.003
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>
<Step type="dynamic">
0.004
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>
Here are the 3 steps I originally used to parse these steps:
if data_point.getAttribute('type') == 'dynamic':
# Step 1: very fast
data_text = data_point.childNodes[0].nodeValue.replace('\n', ' ')
# Step 2: slow
data_values = [float(f) for f in re.split('\n| ', data_text) if f]
# OR
data_values = list(np.fromstring(data_text, dtype=float, sep=' '))
#Step 3: very fast
data_values_all.append(data_values)
The second step is the one that slows everything down. I tried different approaches, like using numpy.fromstring, itertools.ifilter to speed up the removal of the first and last element after the split, and a few other I don't remember now.
None of them was really effective. To give you a measure, in a file with 600 points in each step and 100k steps, this is how long each step takes in total:
- Step1: 0.2s
- Step2: 30s
- Step3: 0.1s
Does anybody have any suggestion to make this string parsing faster?
Thanks
UPDATE:
After @kjhughes comment, it's clear that the culprit is the float
parsing.
Using int
parsing makes everything super fast. That of course doesn't work for me.
I also just tried the fastnumbers module but performances are basically identical https://pypi.org/project/fastnumbers/