One of the most important criteria for this choice is:
what do existing implementations do?
So far nobody has produced such data.
My general impression, having casually investigated many
implementations, though, alas, not in any systematic way, is that
"historic practice" is essentially random. While simple,
"unambiguous" regexps come out the same on all systems, there is
effectively no commonality among implementations beyond that. Again,
this is just my impression, and I can't quote you from my notebook to
back that up or anything equally useful.
I firmly believe that the original spec authors intended to fix that
situation by giving a simple, unambiguous rule -- and one that opened
the door to quite sophisticated implementations. I believe they
intended (1a), and that they achieved their goals, perhaps without
fully understanding all of the details. Perhaps one of them will
speak up and embarass me by stating otherwise.
Beyond historic practice, you ought to look at the effects various
interpretations have upon _potential_ implementations. The trouble
there is, this is a poorly documented area. I can count on one hand
the set of people I've encountered who are qualified by experience to
think through the issues. I know of no papers or proofs that would
let any of us make definitive statements of any useful variety.
It's interesting to note that, for example, BSD has gotten by for
years with an implementation that does not even reliably find the
longest possible match. I think user's of libc regexps are,
traditionally, quite undemanding.
It's also interesting to note that, in my experience, once you have a
rigorous and sophisticated implementation, the potential applications
for regexps grows considerably. Living up to a strict interpretation
of the spec pays off in the future.
Finally, it is interesting to note that a huge number of open source
and Free Software packages, particularly but not exclusively the GNU
software, have given up on native regexec implementations: they
include their own regexp engines, for performance reasons or feature
reasons.
Given that, I think:
1) The standard has a strict interpretation which I am
personally certain is (1a). That interpretation should be
reaffirmed in the answer(s) to the question(s) at hand.
2) The authors of conformance test suites ought to be advised
to be forgiving of existing implementations, but demanding
of future implementations. One shouldn't suddenly declare
Solaris or AIX non-conforming on the basis of regexps that
hardly anyone uses, but one should attempt to ramp up to a
situation in which regexps are used more heavily.
For an existence proof that implementations of (1a) are possible and
have quite useful performance characteristics, I'll point to rx-posix,
part of "the hackerlab C library", distributed from www.regexps.com.
(That site has been down for a couple of weeks -- it will be back
on-line next week.)
For a weaker existence proof, I'll point to Spencer's matcher as it
occurs in Tcl. That implementation isn't quite (1a). It is closer to
(1b), though, last I checked (again, informally) had some bugs even
under that interpretation. Nevertheless, I believe that (1a) was
intended -- rx-posix arrived at (1a) after Henry described the basic
approach to me in email.
-t
|