Andrew Hume writes:
i think this is true. doug mcilroy (retired, but at dartmouth) wrote a
posix RE parser from scratch from teh spec and i think had some nice
test cases.
I'd also like to mention to the list my own test suite, which covers
many of the issues raised in these questions. It can be found
in "the hackerlab C library" (http://www.regexps.com/hackerlab.html).
A posix-specific version of the test suite can be (trivially) built
and run independently of the rest of the library. See the files:
src/hackerlab/tests/rx-posix-tests/=generic-test-regex.c
src/hackerlab/tests/rx-posix-tests/posix-test-cases.h
My answers in the test suite presume a recursive analysis of
subexpressions, based on a presumption of left associativity (as in
the grammar).
>Suggestion:
>
> Modify the POSIX grammar so that RE concatenation is
> right-associative. This would cause (1a) to have the semantics of
> (1b), thus unifying the two interpretations and removing the
> confusion.
>
i concur. when we were discussing this at the time, for the expression
(a)(b)(c), we intended that the submatch rule apply to (a),
(b), and (c) in that order. again, we explicitly considered
that some REs might change their meaning but anyone that close
to the edge was asking for trouble anyway.
Sigh. So you favor the right associative interpretation. That has
the unfortunate side effect that adding subexpressiosn to the right of
the pattern can change what is matched by earlier subexpressions, even
in cases where the overall match does not change.
I favor the left-associative interpretation because:
1. It's what the grammar says.
2. Perhaps it's a cultural bias, but given a regexp to
parse "by eye", I find left-associativity natural.
3. It has the property that additional subexpressions
to the right, which don't extend the match, have
no effect on what is matched by the subexpressions
to the left.
I find these comments incredibly refreshing:
there was a plain sense that we should define meaning
independent of what any implementation technique gave you
easily or for free.
----
we certainly did not intend ambiguous or complicated semantics.
that is why we chose the simple rule.
----
recall that a RE generates a (possibly infinite) set of strings. when
we say that an RE "matches" a string, we mean that that string is a
member of the set of strings
That last point is especially important. The rules for constructing
BREs and EREs in the spec say only what the set of strings is that the
patterns can match. The leftmost-longest rule is then a separate part
of the spec that tells how to determine subexpression positions.
That's a really clear-thinking way to look at the issues.
-t
|