In message <yyyyyyyyyyyyyyyyyyyyyyyyy@xxxxxxxxxxxxxxx>, Paul Eggert writes:
> Suppose we match the BRE /\(a\{2,3\}\)\1*/ against the string
> "aaaa". POSIX requires the longest match, which is "aaaa". And yet
> Solaris 8 says the longest match is "aaa". And OpenBSD 3.0 says the
> longest match is "aa". Glibc 2.2 gets it right.
>
> Are these incorrect answers the result of conscious implementation
> tradeoffs? No. They're bugs.
As I revisit POSIX, my reading is that it requires the leftmost longest
match, which makes "aaa" the correct result and the glibc behavior
unexpected. The key paragraph in the standard is (I'm actually taking this
from Chapter 9 of the Single Unix Specification, so it may not be what was
originally in POSIX):
Consistent with the whole match being the longest of the leftmost matches,
each subpattern, from left to right, shall match the longest possible
string. For this purpose, a null string shall be considered to be longer
than no match at all. For example, matching the BRE "\(.*\).*" against
"abcdef", the subexpression "(\1)" is "abcdef", and matching the BRE
"\(a*\)*" against "bc", the subexpression "(\1)" is the null string.
Now, I'm no expert in the intended behavior of POSIX regular expressions,
but I am rather intimate with the intended behavior of Perl regular
expression behavior (also leftmost longest), having implemented the
Apache Jakarta ORO regular expression package, and the issue that causes
perhaps the most confusion among users is leftmost longest matching.
For example, \(a\{2,3\}\)\1* will match "aaa" but \(a\{2,3\}\)\1 will
match "aaaa". So any additional clarity that can be added to the wording
in the standard would be helpful. If the intent is for the globally
longest match to result, then the wording needs to be reworked because the
current wording and accompanying examples indicate that it is not intended
for "aaaa" to match your sample expression.
daniel
|