Fastest way to parse large XML with numeric values in Python - Slow float casting

Question

One of the simulation software I use spit out a result file in XML format. The data are collected in <Step> elements. Each step has a string of numbers, either space or new line separated. For large files I could easily have thousands of <Step> to parse.

Here is an example of the XML file:

<Step type="dynamic">
0.002
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>
<Step type="dynamic">
0.003
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>
<Step type="dynamic">
0.004
4.66293670342565747E-15 -1.42108547152020037E-13 -1.1368683772161603E-13 0 0 0
</Step>

Here are the 3 steps I originally used to parse these steps:

if data_point.getAttribute('type') == 'dynamic':
    # Step 1: very fast
    data_text = data_point.childNodes[0].nodeValue.replace('\n', ' ')
    
    # Step 2: slow    
    data_values = [float(f) for f in re.split('\n| ', data_text) if f]
    # OR
    data_values = list(np.fromstring(data_text, dtype=float, sep=' '))
    
    #Step 3: very fast
    data_values_all.append(data_values)

The second step is the one that slows everything down. I tried different approaches, like using numpy.fromstring, itertools.ifilter to speed up the removal of the first and last element after the split, and a few other I don't remember now.

None of them was really effective. To give you a measure, in a file with 600 points in each step and 100k steps, this is how long each step takes in total:

Step1: 0.2s
Step2: 30s
Step3: 0.1s

Does anybody have any suggestion to make this string parsing faster?

Thanks

UPDATE:

After @kjhughes comment, it's clear that the culprit is the float parsing. Using int parsing makes everything super fast. That of course doesn't work for me.

I also just tried the fastnumbers module but performances are basically identical https://pypi.org/project/fastnumbers/

Well-written question. Here's a Q/A related to string splitting performance: [Most efficient way to split strings in Python](https://stackoverflow.com/q/9602856/290085). Do you know for sure that it's `split()` and not the parsing of the strings into floats that's the bottleneck? — kjhughes, Mar 12 '21 at 15:25
thanks. I'm sure the step 2 is the slowest part of the code. I don't think ```split()``` is underperforming. I'm just wondering if there is an even faster way to do what I'm trying to do. Maybe even changing approach completely. — guidout, Mar 12 '21 at 15:33
Right, when I asked about `split()` vs parsing strings into floats (as `float(f)` has to do), I was talking specifically about parts of step 2. — kjhughes, Mar 12 '21 at 15:38
yes, you might be right, the float parsing might be the culprit. Surprisingly the ```if``` at the end is slowing that line even more. I added a new line in the code above and ```np.fromstring``` and ```split``` perform identically. — guidout, Mar 12 '21 at 15:42
you were officially right! it's the ```float``` parsing that's killing it. I just ran the code and parsed as ```int``` (which doesn't work for me of course) and it ran super fast — guidout, Mar 12 '21 at 15:46
Good, you know where to next focus your efforts. Perhaps try [`fastnumbers`'s](https://pypi.org/project/fastnumbers/) `fast_float()`? — kjhughes, Mar 12 '21 at 15:58
dammit, ```fast_float``` didn't work. It actually made it 50% slower — guidout, Mar 12 '21 at 16:54

Fastest way to parse large XML with numeric values in Python - Slow float casting

0 Answers0