Email List: Xaustin-group-lX
[All Lists]

RE_CONCAT: question about RE concatenation and subpattern matching

To: Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx>
Subject: RE_CONCAT: question about RE concatenation and subpattern matching
From: Mark Brown <yyyyy@xxxxxxxxxx>
Date: Fri, 5 Apr 2002 13:23:51 -0600
Cc: Isamu Hasegawa <yyyyy@xxxxxxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx>
RE-CONCAT: a question about RE concatenation and subpattern matching

[Some of the people in the GNU projects (Paul Eggert, Tom Lord,
Isamu Hasegawa, myself) were considering consolidating the various
regex implementations in GNU, and came up with a series of questions
concerning the Specification. This is one of those questions.]

When the regular expression /XY/ matches an entire string S, must the
subpattern /X/ match the longest prefix of S that is consistent with
the overall match?

This is a question of interpretation about the following sentence in
the POSIX standard:

Consistent with the whole match being the longest of the leftmost
matches, each subpattern, from left to right, shall match the
longest possible string.

<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02>

For reference, the POSIX rationale goes on to say:

It is possible to determine what strings correspond to
subexpressions by recursively applying the leftmost longest rule to
each subexpression, but only with the proviso that the overall
match is leftmost longest.

<http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap09.html#tag_01_09_01>

Here is an example to illustrate the question. Suppose we match the
ERE /((week|wee)(night|knights))(s*)/ to the entire string
"weeknights", and want to know which subpatterns match which
substrings. Here are some possible interpretations:

(1) The "subpatterns" are the two immediate subexpressions. The
longest consistent match for /((week|wee)(night|knights))/ is
"weeknights". Therefore, /(s*)/ matches the empty string.

(2) The "subpatterns" are the (not necessarily immediate)
subexpressions, regardless of whether they contribute directly to
the concatenation. One subexpression is considered to be left of
another subexpression if it occurs before the other in the
postorder traversal of the ERE's parse tree, when the ERE is
parsed according to the POSIX grammar for EREs. In this
left-to-right order, the subexpressions are:

/week/
/wee/
/week|wee/
/(week|wee)/
/night/
/knights/
/night|knights/
/(night|knights)/
/(week|wee)(night|knights)/
/((week|wee)(night|knights))/
/s/
/s*/
/(s*)/
/((week|wee)(night|knights))(s*)/

(There are actually more subexpressions -- for example, /night/
has eight subexpressions /n/, /i/, /ni/, /g/, /nig/, /h/, /nigh/,
/t/ -- -- but these extra subexpressions do not contribute to the
analysis and are omitted for brevity.)

We apply the longest-match rule to each of these subexpressions
in order. The longest consistent match for /week/ is "week", so
/week|wee/ and /(week|wee)/ match "week". The longest
consistent match for /night/ against the trailing suffix "nights"
is "night", so /night|knights/ and /(night|knights)/
both match "night". Hence /(week|wee)(night|knights)/ and
/((week|wee)(night|knights))/ both match "weeknight". Finally,
/s/, /s*/, and /(s*)/ match "s".

(3) The standard is ambiguous, and both the above interpretations are
valid. Perhaps other interpretations are valid as well. A
conforming implementation might even use different
interpretations at different times.

Here are some arguments for and against these interpretations. These
lengthy arguments are briefly summarized at the end of this section.

pro (1):

(1) is simple and straightforward.

con (2):

(2) is counterintuitive, because it causes subexpressions to not
find a longest consistent match, despite the longest-match rule.
For example, under (2), when matching /(.|..).*/ against "ab",
most people would expect /(.|..)/ to match "ab" because it is a
longer match, but (2) requires it to match the shorter substring
"a". Conversely, most people would expect /(.|..)/ and /(.{1,2})/
to be equivalent, but (2) says that when matching /(.{1,2}).*/
against "ab", /(.{1,2})/ matches "ab" instead of the "a" that
/(.|..)/ would match in its place.

Here is another example. (2) causes the interpretation of a
regular expression X to be changed merely because X is followed by
/.*/. This contradicts the left-to-right nature of the subpattern
rule. For example, under (2), when matching /(.|..)/ against
"ab", /(.|..)/ matches the whole string; but when matching
/(.|..).*/ against "ab", /(.|..)/ matches only "a" even though it
would be consistent for it to match the same "ab" as before.

In general, submatching must be context-dependent, but it is
counterintuitive for /.*/ to provide context from the right in a
matcher that is supposed to be left-to-right. This
counterintuitive behavior does not occur with (1), nor does it
occur with a purely greedy NFA-style matcher that does not conform
to POSIX because it does not always find the longest match; but it
does occur with (2).

A more general statement of this point is that '|' is not
commutative. For example, /(week|wee)(night|knights)(s*)/ differs
from /(wee|week)(night|knights)(s*)/ when matching against
"weeknights". The former pattern matches /(s*)/ to "s", whereas
the latter one matches /(s*)/ to the empty string. This surprises
some users and is an annoyance for programs that construct regular
expressions.

pro (2):

Many people want the left subexpression of '|' to have priority.

con (2):

POSIX explicitly rejected the expectation that the left
subexpression of '|' should have priority, by ruling that it's
more important to have the longest match than to have the left
subexpression match. The left subexpression does not have
priority in /(.|..)/, for example.

pro (2):

If you want the longest match, then in almost all cases you can
reorder the alternatives of '|' so that the leftmost subexpression
matches the longest substring. With (2) you have the option of
preferring a shorter match for a '|' subexpression, if that is
what you want. (1) does not give you that option.

con (2):

But under (2) you can't give an alternative priority at the top
level, unless the alternatives happen to be the same length.

It is odd for (2) to have one rule at the top level, but a
different rule for subexpressions. It would be more consistent if
the same rule applied at all levels. It is particularly odd that
the standard doesn't mention this discrepancy; it indicates that
the discrepancy was not intended.

con (2):

It's implausible that (2) was intended, given that the standard
does not mention issues like postorder traversal which are
inherent to (2). Had the POSIX authors intended the
counterintuitive behaviors inherent to (2), they would have
mentioned them.

pro (2):

In effect, (1) specifies a preorder traversal, and (2) specifies a
postorder traversal. The standard does not specify either
preorder or postorder, so neither is more plausible than the other
and postorder should be allowed.

con (2):

The POSIX authors were clearly thinking about DFA-based
implementations. This is why
<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02>
gives the example where /(wee|week)(knights|night)/ matches all of
"weeknights". Here, the authors are making the point that a
DFA-based matcher must be run while a further match is possible,
and that one cannot stop the DFA on the first match.

Had the POSIX authors wanted to allow (2), they would have been
inclined to use an example that distinguishes between a greedy
NFA-style matcher and (2), because a greedy NFA-style matcher is
incorrect but is superficially similar to (2). For example, they
could have used the example of matching
/(week|wee)(night|knights)/ against "weeknights", where a greedy
NFA-style matcher will match only "weeknight" instead of the
"weeknights" required by (2).

The fact that the POSIX authors did not give such an example
indicates that they were not considering (2) as a possible
interpretation.

pro (1):

(1) takes less space to implement than (2) does. If P is the
pattern length and S is the string length, (1) requires only O(P)
space, but (2) requires O(S) space and the constant multiple of S
is relatively large in practice. (2) needs this space to record
the states of the DFA match, so that it can then run an NFA
forward across the same string and use the results of the DFA
match as an oracle at each nondeterministic branch. (2) can be
done in less than O(S) space, but only with an exponential time
overhead that would be unacceptable in practice.

pro (2):

There is little difference in performance in practice if you
optimize (2) carefully enough, because the worst cases occur only
in pathological examples.

con (2):

Optimizing (2) is tricky, and this trickiness is a maintenance
burden that it would be better to avoid.

pro (2):

Many existing POSIX matchers seem to use (2), including Solaris 8
and the GNU C library 2.2.

con (2):

However, these implementations predate POSIX.2-1992, and though
they were modified to conform to POSIX, quite possibly this
particular issue hasn't been tested in the available test suites,
or perhaps the test suites themselves are not correct. Since
POSIX.2-1992 explicitly rejected the common existing practice of
greedy NFA-style matching, one shouldn't rely too heavily on the
perhaps-unmodified continuation of that existing practice.

con (2):

The BSD matcher uses (2), but it is not a POSIX matcher because it
does not consistently find the longest overall match. Under BSD,
matching the BRE /\(\(abc\)\{0,2\}\(abc\)\{0,2\}\)\3/ against the
string "abcabc" incorrectly yields only "abc" in both FreeBSD 3.0
and OpenBSD 3.0. This suggests that (2) is correlated with the
pre-POSIX practice that POSIX.2-1992 rejected.

pro (1):

Some POSIX matchers use (1). These include the Rx matcher and the
Spencer-derived matchers used by Tcl and others.

pro (3):

Insisting on either (1) or (2) would invalidate some existing
implementations.

con (3):

Insisting on either (1) or (2) would invalidate few if any
existing applications, and applications are more important than
implementations.

The POSIX authors explicitly chose DFA-style semantics for REs
because they are "easier to define and describe" than other
semantics. They also wrote:

It is thought that dependencies on the choice of rule are rare;
carefully contrived examples are needed to demonstrate the
difference.

(Both these quotes are taken from
<http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap09.html#tag_01_09_02>
.)

This suggests that the POSIX authors desired a simple,
deterministic semantics for regular expressions, and thought that
most users wouldn't care so much which semantics were chosen, so
long as the behavior was standardized. It is therefore unlikely
that the POSIX authors intended the complicated, ambiguous
semantics that would result if multiple interpretations were
valid.

The bottom line of this discussion:

(1) is simpler and more intuitive, and it better fits the intent of
POSIX's overall longest-match rule. (2) is more popular in
practice. The differences between (1) and (2) rarely arise in
practice, so it wouldn't affect applications much if either
interpretation were chosen.


---
Mark S. Brown
Senior Technical Staff Member, IBM Server Group
512.838.3926 fax 512.838.3882
yyyyy@xxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>