1

Trying to optimize some code that reuses a matched group, I was wondering whether accessing Match.group() is expensive. I tried to dig in re.py's source, but the code was a bit cryptic.

A few tests seem to indicate that it might be better to store the output of Match.group() in a variable, but I would like to understand what exactly happens when Match.group() is called, and if there is another internal way to maybe access the content of the group directly.

Some example code to illustrate a potential use:

import re

m = re.search('X+', f'__{"X"*10000}__')

# do something
# m.group()

# do something else
# m.group()
Timings

direct access:

%%timeit
len(m.group())
220 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

intermediate variable:

X = m.group()
%%timeit
len(X)
# 51 ns ± 0.172 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

References:
current re.py code (python 3.10)
current sre_compile.py code (python 3.10)

removing the effect of attribute access (doesn't change much)

G = m.group

%%timeit
len(G())
230 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Are you sure this is directly related to `Match.group()` and not just a case of [Why is local variable access faster than class member access](https://stackoverflow.com/questions/12397984/why-is-local-variable-access-faster-than-class-member-access-in-python) ? – DeepSpace Jan 27 '22 at 11:10
  • 1
    Also, I didn't thoroughly go over `re`'s code, but if `group()` is a function/method then it makes perfect sense that calling it multiple times will have an overhead over just calling it once. A better test would be to save a reference to it (`x = match.group`) and then time the calls to `x()`. – DeepSpace Jan 27 '22 at 11:14
  • @DeepSpace nice suggestion, I tried it and it doesn't change much the timing – mozway Jan 27 '22 at 11:16

2 Answers2

5

The match object holds a reference to the original string you searched in, and indexes where each group starts and ends, including group 0, the whole matched string. Every call to group() slices the original string to create a new string to return.

Saving the return value to a variable avoids the time and memory cost of having to slice the string every time. (It also avoids repeating the method call overhead.)

You can see that group() isn't just returning a cached string by the fact that the return value isn't always the same object:

>>> import re
>>> x = re.search(r'sd', 'asdf')
>>> x.group() is x.group()
False

If you want to see the implementation of group(), it's match_group in Modules/_sre.c in the Python source code.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
user2357112
  • 260,549
  • 28
  • 431
  • 505
1

.group might be used to access whole match (when no arguments are provided) or certain group (when number of it is given) for example

import re
m = re.match('(X)X+','XXXXX')
print(m.group(1)) # output X

Note that re.Match instances have .string which you might access to full string, that is

import re
m = re.match('(X)X+','XXXXX')
print(m.string) # output XXXXX
Daweo
  • 31,313
  • 3
  • 12
  • 25