Email List: Xaustin-group-lX
[All Lists]

RE-ITERATE: a question about RE iteration and subpattern matching

To: Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx>
Subject: RE-ITERATE: a question about RE iteration and subpattern matching
From: Mark Brown <yyyyy@xxxxxxxxxx>
Date: Fri, 5 Apr 2002 16:24:03 -0600
Cc: yyyyy@xxxxxxxxxxxxxx, "Shoji Sugiyama" <yyyyy@xxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx>

[Some of the people in the GNU projects (Paul Eggert, Tom Lord,
Isamu Hasegawa, myself) were considering consolidating the various
regex implementations in GNU, and came up with a series of questions
concerning the Specification. This is one of those questions.]

RE-ITERATE: a question about RE iteration and subpattern matching

When the ERE /(R)*/ matches an entire string S, is the subpattern
/(R)/ required to match the longest possible suffix of S that is
consistent with an overall match?

This question follows up the earlier questions RE-CONCAT and RE-ASSOC;
please see the discussion of the earlier question for much of the
background behind this question.

This question is about the requirements given by the following quotes
from the regcomp description:

(Subexpression i begins at the ith matched open parenthesis,
counting from 1.)...

1. If subexpression i in a regular expression is not contained
within another subexpression, and it participated in the match
several times, then the byte offsets in pmatch[i] shall delimit
the last such match.

<http://www.opengroup.org/onlinepubs/007904975/functions/regcomp.html>:

Suppose, for example, we match the pattern /(a.*z|b.*y)*.*/ to the
string "azbazbyc", and want to know which substring was matched by the
subpattern /(a.*z|b.*y)/. Here are some possible interpretations:

(1) /(a.*z|b.*y)*/ matches the longest possible prefix "azbazby".
The last match for /(a.*z|b.*y)/ is therefore the longest suffix
of this prefix that is consistent with an overall match, namely
"bazby".

(2) First the subpattern /(a.*z|b.*y)/ is applied left to right
across the string, looking for matches that are consistent with
an overall match. This procedure matches "azbaz" and "by" in
turn. The last match for /(a.*z|b.*y)*/ is therefore "by".

(3) The standard is ambiguous, and both the above interpretations are
valid. Perhaps other interpretations are valid as well. A
conforming implementation might even use different
interpretations at different times.

Interpretation (2) needs some further explanation, since it relies on
a depth-first search procedure that may not be immediately obvious.
According to this procedure, to match the ERE /(A)*/ to the prefix of
a string S under the constraint of an overall match to S, apply the
following steps:

(i) Apply A to the prefix of S once, subject to the constraint
of overall match.

(ii) Suppose that A's match consumes the initial substring S1,
which must be the quasi-longest such substring.

(iii) If S1 is empty, succeed and report that A matched the empty string.

(iv) Otherwise, repeatedly apply A to the rest of S, each time
finding the quasi-longest initial prefix matching A under the
constraint of overall match. Repeat this process until the
next submatch would either fail or would be empty.

(v) Report the last match, which in this case must be nonempty.

The term "quasi-longest" is used above because (2)'s matching
procedure is greedy and in some cases doesn't find the longest match
for a subexpression. Instead, it finds the longest match that a
greedy matcher can find subject to the constraint of overall match.

Interpretations (1)-(3) are related to interpretations (1)-(3) of
question RE-CONCAT, and share many of their pros and cons. Here are
some more specific arguments for and against these interpretations:

pro (1):

The standard explicitly says that the subpattern must match the
"longest possible string"
<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02>
.
"bazby" is the longest possible string here, so it must be the match.

pro (2):

The phrase "left to right" implies that the matcher must apply the
subpattern several times, left to right, greedily but consistently
with the overall match being the longest.

con (2):

Even more so than with RE-CONCAT, interpretation (2) is quite
complicated. It is unlikely that the POSIX authors intended (2),
since (1) is a much more straightforward interpretation.

---
Mark S. Brown
Senior Technical Staff Member, IBM Server Group
512.838.3926 fax 512.838.3882
yyyyy@xxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>
  • RE-ITERATE: a question about RE iteration and subpattern matching, Mark Brown <=