What does pexpect pattern ".+" do?

Question

I have next code:

test.py:

import pexpect
import sys

p = pexpect.spawn("ping 10.192.225.199", encoding="utf-8")
while True:
    try:
        index = p.expect([".+", pexpect.EOF, pexpect.TIMEOUT], timeout=1)
        if index == 0:
            print("===")
            print(p.after)
            print("===")
    except Exception as e:
        print(e)

Execution:

$ python3 test.py
===
PING 10.192.225.199 (10.192.225.199) 56(84) bytes of data.

===
===
64 bytes from 10.192.225.199: icmp_seq=1 ttl=63 time=0.607 ms

===
===
64 bytes from 10.192.225.199: icmp_seq=2 ttl=63 time=0.587 ms

===
...

It looks .+ could fetch the whole line of ping command output during one time iteration.

But, someone suggest this to me, in the official document, it said:

Beware of + and * at the end of patterns
Remember that any time you try to match a pattern that needs look-ahead that you will always get a minimal match (non greedy).
For example, the following will always return just one character:
child.expect ('.+')
This example will match successfully, but will always return no characters:
child.expect ('.*')

What does will always return just one character mean? Why I can get a full line in my minimal example? Isn't it should be one character per loop?

BTW, it's extremely strange that if I change .+ to .*, then just as the document said: will always return no characters. The behavior for .* same as the document said, but .+ not ...

alexis · Accepted Answer · 2021-07-02T14:41:54.993

3

Edit: Changed the focus to address what you are really asking about, as clarified in your comments.

Beware of + and * at the end of patterns Remember that any time you try to match a pattern that needs look-ahead that you will always get a minimal match (non greedy). For example, the following will always return just one character: child.expect ('.+')

I don't know if this statement in the documentation was ever correct, but it does not match the current behavior. The regex .+ does not always match only one character, even when used by itself. The regex .* may match zero characters under some circumstances, but not always. The correct statement is: Both regexes will match whatever is in the read buffer, and no more. Why the difference in your observations? Since .+ must consume at least one character, it will trigger a read operation -- which fills the read buffer.

Explanation

Remember that pexpect was built for communication with an interactive process: Its input is not just sitting there waiting to be read, it is generated dynamically over time and in response to events. So pexpect will not try to read input unless it has a reason to. If the input buffer is empty and it comes to an expectation that can be satisfied with a zero-length string, it has no need to read further, and so it will not. More generally: If an expectation can be satisfied with what is already there, no more reads will be attempted.

So here is an experiment you can try with your input source:

>>> p.expect('6')
0
>>> p.after
'6'
>>> p.expect(".*")
0
>>> p.after
'4 bytes from 142.250.184.238: icmp_seq=38 ttl=112 ...'

What happened here? The first expectation caused a line to be read in ("64 bytes from ..."), but only consumed the first character. Then the .* matched the rest.

You can get the same effect with a single expectation that consumes at least one character, e.g. 6.* or ..* etc. Like .+, these will cause input to be read and then consume the rest of the available input.

For comparison, try using true non-greedy regexes, .*? and .+?. These will always match zero or one characters, respectively, no matter where you use them.

edited Jul 02 '21 at 14:41

answered Jul 02 '21 at 09:05

alexis

48,685
16
101
161

The behavior of `.*` is same as the documentation said, non-greedy; But the behavior of `.+` with real code looks not same as the document said, why? It just output the whole line, not just one character, it becomes greedy? I understand the `?+*` in re mean, just can't understand its behavior in pexpect... – atline Jul 02 '21 at 09:09
`So there is no real difference between how .* and .+ behave`. Sorry, I really didn't catch your point. But if I change `.+` to `.*` in above code, the `print(p.after)` of my code will always empty, just like the document said every `expect` will return no characters. But with `.+`, as the document said every `expect` will output just one character, however, in my code output, the `print(p.after)` always return whole line. Sorry if I missed any of your point... – atline Jul 02 '21 at 09:41
The behavior is _not_ as the documentation says -- either it was always wrong or the behavior has changed in the meantime; but it is the same for both of these regexes. They are not greedy in the usual sense (they will not keep matching until no more characters can be added), but you will not always get 0 or 1 characters, either. – alexis Jul 02 '21 at 13:42
The documentation claims that it adds to the buffer one character at a time. – Barmar Jul 03 '21 at 21:48
@Barmar Do you have more detail ideas about this? Thanks! – atline Jul 05 '21 at 06:45
@atline Sorry, I don't know anything other than what's in the doc. But the answer's explanation that it buffers as much as it can, rather than 1 character at a time, makes sense. – Barmar Jul 05 '21 at 12:49
What this means is that you can't depend on `.+` returning the longest possible match, it depends on how much it happens to read at a time. It *could* return just a single character. – Barmar Jul 05 '21 at 12:51
@atline, by all means you should try for more information. But the documentation clearly states that `expect(".+")` "will always return just one character", which is provably incorrect for the current version of `pexpect`. So I wouldn't take the rest of it too literally either; these are not "non-greedy" quantifiers in the regex sense, and also they are certainly not "lookahead" (as the doc calls them) in the regex meaning of this term. – alexis Jul 05 '21 at 13:16
PS. @atline, if you need a definite answer just look at [the source code](https://github.com/pexpect/pexpect). Or single-step your program in a debugger and see what happens. It's not that complex. – alexis Jul 05 '21 at 13:19
Thanks, accept your suggestion I debug into this, and think you are probably correct. – atline Jul 10 '21 at 06:55

score 0 · Answer 2 · answered Jul 10 '21 at 06:40

Just as suggested by @alexis, I attach a debugger to dig into the code.

First experiment, as next diagram, I set a break point at index = p.expect([".+", pexpect.EOF, pexpect.TIMEOUT], timeout=1), and wait for 5 seconds before I step over(to assure the ping -c 4 finish so I can get more output).

With this, I found just with one time p.expect, I could get all outputs of ping -c 4. So at my initial example, I can just get one line per p.expect just because at that time, the buffer of pexpect didn't get so many data.
Second experiment, as next diagram, I step in the p.expect, and find it use index = searcher.search(window, len(data)) to match.

With just one time p.expect, when the window has 434 characters, the expect for .+ also make spawn.after has 434 characters.

So, I think just as follower's comments, the documentation is somewhat not correct or outdated. The .+ surely could match not only one character in buffer, the length just depends on how many characters currently in buffer, also the window size.

What does pexpect pattern ".+" do?

2 Answers2

Explanation

Linked