Email List: Xaustin-group-lX
[All Lists]

Re: RE-ASSOC: a question about the associativity of RE concatenation

To: yyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Re: RE-ASSOC: a question about the associativity of RE concatenation
From: Paul Eggert <yyyyyy@xxxxxxxxxxx>
Date: Sat, 6 Apr 2002 22:27:09 -0800 (PST)
Cc: yyyyy@xxxxxxxxxxxxxx, yyyyy@xxxxxxxxxx, yyyy@xxxxxxx
References: <FE465D8F724E3F4F811D067203A214AE045CE0F4@red-msg-08.redmond.corp.microsoft.com>
> Date: Fri, 5 Apr 2002 16:27:36 -0800
> From: "Donn Terry" <yyyyyy@xxxxxxxxxxxxx>
> 
> The critical piece of information that's missing here is that for
> the most popular implementations (Spencer, "original UNIX", Linux,
> come to mind but might not be the only choices) what actually happens
> today.  (And historically, if things have changed.)

That would take a lot of work, and though I'd be amused to see the
results if someone does it, I'm not sure it would be an efficient use
of anybody's time.

RE-CONCAT, RE-ASSOC, and RE-ITERATE ask about dark corners of RE
semantics that most applications rarely visit.  Like Tom Lord, I
suspect that most existing implementations return somewhat random
results in this area, results that are not consistent consequences of
conscious implementation decisions, but rather are unforeseen
consequences of code that has never been thought through completely.

Let me give an example of this problem, an example that is independent
of RE-CONCAT, RE-ASSOC, and RE-ITERATE:

  Suppose we match the BRE /\(a\{2,3\}\)\1*/ against the string
  "aaaa".  POSIX requires the longest match, which is "aaaa".  And yet
  Solaris 8 says the longest match is "aaa".  And OpenBSD 3.0 says the
  longest match is "aa".  Glibc 2.2 gets it right.

  Are these incorrect answers the result of conscious implementation
  tradeoffs?  No.  They're bugs.

  Do users depend on the incorrect answers?  Not likely.

  Would it be useful for users if POSIX were modified to allow all
  three answers?  Nope.

I'd guess there are dozens of other such examples that one could come
up with, if one had the time, and they would illustrate that the RE
code in most existing implementations is full of bugs in areas that
don't get exercised much.

So, to answer RE-CONCAT, RE-ASSOC, and RE-ITERATE, I would not worry
overly much about the accidents of current implementations.  It's more
important to understand the intent of the standard.  After all, if the
Glibc maintainers had merely wanted to imitate what other
implementations do, they would have just run some examples on them;
they wouldn't have bothered to ask austin-group-l about these issues.


> Given the "existing practice" decisions that were made during the last
> revision (to the extent of backing out text that was in .2-1992 that
> was intentionally and with full knowledge not compatible with existing
> practice when it was written)

That was quite a different situation.  In that case:

 (A) POSIX 1003.2-1992 invented new RE syntax (e.g., /[^[:alpha:]]/).
 (B) The semantics extended widespread existing practice.
 (C) The semantics were never correctly supported by any implementation.
 (D) The semantics were impossible to implement in user code.

Here, we're talking about a case where:

 (A) POSIX 1003.2-1992 was interpreting longstanding RE syntax.
 (B) The semantics contradicted widespread existing practice.
 (C) The semantics are supported by some implementations.
 (D) The semantics can be implemented in user code.

<Prev in Thread] Current Thread [Next in Thread>