Keld:
getting the results as per your example 1b
"getting the results as per" could mean either (1b), (2), or (3).
It is a mistake to look at this on a case-by-case basis, saying which
answers for particular regexps and strings look better to you --
rather, you must compare the general rules.
You say you favor the answer you do because, as the question
presumes-without-substantiation, it better matches "user intent".
Indeed, that particular _result_ may match typical user expectation --
but do any of the four _general_rules_? I believe it is likely that
each of the four proposed interpretations of the rule will surprise
users in some cases -- no fully worked out regexp semantics is likely
to never surprise users. For example,
1a will treat (A)((B)(C)) differently from (A)(B)(C)
1b will treat ((A)(B))(C) differently from (A)(B)(C).
Moreover, it means that the grammar in the spec is
written incorrectly. Moreover, it means that (A).*
will produce a different result for some values of A
even when the length of the overall match does not change.
2 will treat (A|B)C differently from (B|A)C
3 will thwart users wanting to write regexps that are
portable across conforming implementations
We can imagine other interpretations, too, though these get even
farther away from the language of the standard.
You propose an incomplete operational model for matching:
For the 3 components mentioned then the first (leftmost)
component finds "week" and the 2nd component then needs to
work on the remainder of the substring.
"needs to work on the remainder of the substring" suggests that you
are thinking in terms of a particular strategy for getting the right
answer by imagining the matcher's algorithm in action. That's the
spirit of Perl regexps, not Posix. The ECMA javascript standard gives
such an operational semantics (for Perl-like regexps).
The Posix standard, on the other hand, gives a static ranking of
possible substring position mappings for a given regexp and a string.
It's not based explicitly on a particular algorithm -- it's based on a
static property of the lengths of substring positions. (E.2.8.2).
One can tell by immediate inspection which of two proposed answers is
better. Of the proposed rules, (1a) and (1b) alone seem to reflect
the static nature of the Posix rule. Of those two, only (1a)
accurately reflects the grammar in the spec.
It's also interesting to consider what impacts the various
interpretations have on implementations. Alas, the question posers
did not have the time to properly address those issues in the
question.
-t
|