Email List: Xaustin-group-lX
[All Lists]

RE-ASSOC: a question about the associativity of RE concatenation

To: Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx>
Subject: RE-ASSOC: a question about the associativity of RE concatenation
From: Mark Brown <yyyyy@xxxxxxxxxx>
Date: Fri, 5 Apr 2002 15:37:31 -0600
Cc: yyyyy@xxxxxxxxxxxxxx, "Shoji Sugiyama" <yyyyy@xxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx>
RE-ASSOC: a question about the associativity of RE concatenation

[Some of the people in the GNU projects (Paul Eggert, Tom Lord,
Isamu Hasegawa, myself) were considering consolidating the various
regex implementations in GNU, and came up with a series of questions
concerning the Specification. This is one of those questions.]

Must pattern matching strictly follow the left-associative POSIX
grammar for regular expressions, or is pattern matching allowed to
behave as if the grammar had been right-associative?

This question follows up the earlier question RE-CONCAT; please see
the discussion of the earlier question for much of the background
behind this question.

The POSIX grammar for EREs contains this production:

ERE_branch : ERE_expression
| ERE_branch ERE_expression

<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05_03>

This grammar is left-associative, which means that the ERE /ABC/ is
parsed as if it were /(AB)C/, aside from parenthesis numbering.

In the REs of classical formal language theory developed by Stephen
Kleene and others, the concatenation of REs is associative, in the
sense that the REs /(AB)C/ and /A(BC)/ describe the same regular sets.
But POSIX REs might have different possible interpretations depending
on how one associates subpatterns to matched strings.

For example, suppose we match the ERE /(week|wee)(night|knights)(s*)/
to the string "weeknights", and want to know which subpatterns match
which substrings. Here are some possible interpretations:

(1a) The "subpatterns" are the two immediate subexpressions,
interpreted according to the grammar, namely
/(week|wee)(night|knights)/ and /(s*). The longest consistent
match for /(week|wee)(night|knights)/ is "weeknights".
Therefore, /(s*)/ matches the empty string.

(1b) Even though the grammar is left-associative, the matching
semantics are right-associative. The "subpatterns" are the two
subexpressions /(week|wee)/ and /(night|knights)(s*)/. The
longest consistent match for /(week|wee)/ is "week". We then
apply the rule recursively, and find that the longest consistent
match for /(night|knights)/ against "nights" is "night".
Therefore, /(s*)/ match "s".

(2) The "subpatterns" are all the subexpressions, not just the
concatenated subexpressions (see interpretation (2) of question
RE-CONCAT), so the matching semantics are inherently
right-associative. This returns the same match as (1b), but for
the slightly different ERE /(wee|week)(night|knights)(s*)/, (1b)
still says /(s*)/ matches "s" but (2) says /(s*)/ matches the
empty string.

(3) The standard is ambiguous, and some or all the above
interpretations are valid. Perhaps other interpretations are
valid as well. A conforming implementation might even use
different interpretations at different times.

(1a) and (1b) are both consistent with interpretation (1) of question
RE-CONCAT. Note that a user can override either (1a) or (1b) by using
explicit parenthesization. (2) is consistent with RE-CONCAT
interpretations (2); under (2), explicit parenthesization is
irrelevant.

Here are some arguments for and against these interpretations. These
lengthy arguments are briefly summarized at the end of this section.

pro (1a):

The most natural interpretation of the standard and rationale is
that the implementation must recurse through the regular
expression's parse tree, as indicated by the grammar.

The ERE must be parsed strictly according to the grammar, because
the standard says that the ERE "grammar takes precedence over the text".
<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05>

con (1a):

(1a) is counterintuitive, because it says that when matching
/XYZ/, it is more important to maximize the length of the match
for /XY/ than it is to maximize the length of the match for /X/.
This contradicts the left-to-right nature of the subpattern rule.

Had the POSIX authors intended (1a), they would have written the
grammar to be right-associative, to avoid this counterintuitive
behavior of (1a).

pro (1a):

Had the POSIX authors intended that it be more important to
maximize the length of /X/ than to maximize the length of /XY/
when matching /XYZ/, they could easily done so by using a
right-associative grammar for ERE_branch. However, they used a
left-associative grammar, indicating that (1a) was intended over
(1b).

This is reminiscent of the situation in the C programming
language. Classical addition is associative, but Standard C
requires that X+Y+Z must be evaluated according to C's
left-associative grammar, and must therefore be evaluated as
(X+Y)+Z and not as X+(Y+Z). The distinction matters, for example,
if floating-point arithmetic is used, due to rounding errors.
Similarly, classical RE concatenation is associative, but POSIX RE
concatenation is not, due to the side effects of subpattern
matching, so it is important to follow the associativity rules
specified by the grammar.

con (1a):

The C Standard explicitly says that X+Y+Z must be evaluated as
(X+Y)+Z, whereas the POSIX Standard does not explicitly say how to
follow the RE grammar when determining matched subpatterns.
Admittedly the X+Y+Z statement in the C Standard is a
non-normative footnote, but the POSIX authors could also have
added a footnote if they had wanted to make it clear that (1a) was
intended.

pro (1b) and (2):

The semantics should not be contorted by the particular method used
to define the syntax. The standard says:

Portions of this volume of IEEE Std 1003.1-2001 are expressed in terms
of a special grammar notation. It is used to portray the complex
syntax of certain program input. The grammar is based on the syntax
used by the yacc utility. However, it does not represent fully
functional yacc input, suitable for program use; the lexical
processing and all semantic requirements are described only in textual
form. The grammar is not based on source used in any traditional
implementation and has not been tested with the semantic code that
would normally be required to accompany it.
<http://www.opengroup.org/onlinepubs/007904975/utilities/xcu_chap01.html#tag_01_10>

and this indicates that the grammar should not be taken overly
seriously when interpreting the semantics.

pro (1a):

Some POSIX matchers use (1a). These include the Rx matcher and the
Spencer-derived matchers used by Tcl and others.

See also the arguments pro and con RE-CONCAT's interpretations
(1), (2), and (3).

The bottom line of this discussion:

This question is probably moot unless RE-CONCAT interpretation (1) is
chosen, in which case the realistic choices are (1a) or (1b).

(1a) follows the POSIX standard more strictly, but (1b) typically
matches user intent better. The differences between (1a) and (1b)
rarely arise in practice, so it wouldn't affect applications much if
either interpretation were chosen. A user can override either (1a)
or (1b) by using explicit parenthesization.

Suggestion:

Modify the POSIX grammar so that RE concatenation is
right-associative. This would cause (1a) to have the semantics of
(1b), thus unifying the two interpretations and removing the
confusion.



---
Mark S. Brown
Senior Technical Staff Member, IBM Server Group
512.838.3926 fax 512.838.3882
yyyyy@xxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>