Email List: Xaustin-group-lX
[All Lists]

Re: RE-ASSOC: a question about the associativity of RE concatenation

To: yyyyyy@xxxxxxxxxxx
Subject: Re: RE-ASSOC: a question about the associativity of RE concatenation
From: Tom Lord <yyyy@xxxxxxx>
Date: Tue, 30 Apr 2002 12:59:40 -0700 (PDT)
Cc: yyy@xxxxxxxxxxxxxxxx, yyyyyyyyyyyyyy@xxxxxxxxxxxxx, yyyyy@xxxxxxxxxxxxxx, yyyyy@xxxxxxxxxx
References: <200204091857.OAA35259@raptor.research.att.com> <200204092229.g39MTeX00802@shade.twinsun.com> <200204121008.DAA20963@morrowfield.home> <200204290716.DAA01769@raptor.research.att.com> <200204291846.g3TIkB109921@shade.twinsun.com> <200204300313.XAA78654@raptor.research.att.com> <200204300627.g3U6RV809501@sic.twinsun.com>

        eggert:

        But I must start off thinking 'parse tree', since that's how
        the POSIX standard defines ERE syntax.

I agree.  If necessary, we can argue that at length.  However, gsf is
only trying to show that the standard is ambiguous and _might_ lead to
right-associative semantics for concatenation.

Let me play devils advocate, for the moment, and argue his position in
the way that convinced me not to spend a lot of time arguing against
the possibility that the spec is ambiguous, if I can avoid it.  (This
is my penance for orginally labeling his reading "absurd". :-)

`subpattern', in the matching rule, is ordinary english.  It just
means any kind of constituent part of a larger pattern.  So, what is a
"constituent part"?  It might be, as lord and eggert maintain, a part
generated by any non-terminal in the grammar.  _Or_, it might refer to
the construction of complex regexps given in the text.

In the text, EREs and BREs matching a single character are defined.
Then rules are given to build up more complex patterns from those.
For example, you can add `+' or `?' after an ERE matching a single
character or an ERE enclosed in parentheses.  Another example: you can
(subject to the operator precedence rules) form a new ERE by
concatenating several other EREs.

So, in terms of the construction of the text, here is a definition for
`nth_subpattern':


        function nth_subpattern (pattern, n)
        {
          if pattern is a parenthesized subexpression
                return the pattern that is enclosed in parens

          if the lowest precedence operator in pattern is *,+,{}, or ?
                return the pattern with that operator removed

          if the lowest precedence operator in pattern is |
                treat | as a variable arity operator
                return the nth branch

          if the lowest precedence operator in pattern is concatenation
                treat concatenation as a variable arity operator
                return the nth concatenated element
        }


Some examples:


        nth_subpattern(/ab*c/, 0)       => /a/
        nth_subpattern(/ab*c/, 1)       => /b*/
        nth_subpattern(/ab*c/, 2)       => /c/


        nth_subpattern(/foo|foobar|baz/, 0)     => /foo/
        nth_subpattern(/foo|foobar|baz/, 1)     => /foobar/
        nth_subpattern(/foo|foobar|baz/, 2)     => /baz/


        nth_subpattern(/b*/, 0)                 => b

        nth_subpattern(/[abc]/, 0)              => <undefined>


Lord and eggert use an alternative definition for `nth_subpattern',
based on the grammar:

        function nth_subpattern (pattern, n)
        {
          if pattern is a parenthesized subexpression
                return the pattern that is enclosed in parens

          if the lowest precedence operator in pattern is *,+,{}, or ?
                return the pattern with that operator removed

          if the lowest precedence operator in pattern is |
                treat | as a left-associative binary operator
                return the nth (0 or 1) branch

          if the lowest precedence operator in pattern is concatenation
                treat concatenation as a left-associative binary operator
                return the nth (0 or 1) concatenated element
        }

With examples:

        nth_subpattern(/ab*c/, 0)       => /ab*/
        nth_subpattern(/ab*c/, 1)       => /c/

        nth_subpattern(/foo|foobar|baz/, 0)     => /foo|foobar/
        nth_subpattern(/foo|foobar|baz/, 1)     => /baz/


        nth_subpattern(/b*/, 0)                 => b

        nth_subpattern(/[abc]/, 0)              => <undefined>



Now, taking off my devil's advocate costume -- I think there are good 
arguments that the grammar-based definition of `nth_subpattern' is
the only right reading of the spec.  Unfortunately, the arguments are
based on fairly subtle wordplay.

So, what happens if we stipulate that the spec is ambiguous about
whether or not `|' and concatenation are binary operators or variable
arity?  Then, as a consequence, we've stipulated that the spec is 
ambiguous about whether or not concatenation and `|' have left or
right associative semantics.

When interpreting an existing standard, I think we have a set of
constraints to satisfy.  These are:


                   ================================
                   higest priority constraints 
                   (but unordered among themselves)


                        A. What the standard says.

                        B. What historic implementations do,
                           perhaps giving some weight to the
                           conforming vendor implementations 
                           and BSD.

                        C. What the standard authors clearly
                           meant.

                        D. What the historic implementors 
                           clearly meant.

                   ================================
                   low priority constraint
                   (use only to resolve ties)

                        E. What makes the most sense.

                   ================================

Let's examine those constraints:

A. What the standard says.

   We can argue forever about what the standard says, or
   stipulate that it is ambigous.  

   For those of us who believe left-associativity is correct,
   let's be generous and stipulate for the moment that the
   spec is ambiguous.


B. What historic implementors do.

   Libc `regexec' implementations are full of all kinds of bugs.  They
   aren't reliable guides at all.

   Many open source programs, for example, avoid using `regexec'
   at all -- presumably because implementations vary so wildly -- 
   and instead include their own regexp implementations.

   It's significant that the conformance test suites are very weak
   regarding regexps.  

   In short, it doesn't look to me like there is much if any
   application code in the world that depends on the associativity
   question going one way or the other.
   


C. What the standard authors clearly meant.

   The authors are silent on this issue.  Their intent
   does not clearly extend to this issue.  It is quite 
   plausible that they did not realize associativity of
   concatenation matters.  In that case, they can not be 
   said to have had any intent on this issue at all.

   It's significant that you can assume various definitions
   of "subpattern" that _implicitly_ decide the associativity
   of concatenation.  Then you can give another definition
   which is equivalent in every respect except that it 
   _explicitly_ decides associativity in the opposite way.
   It's significant that _none_ of the example regexps in the
   spec are sensative to the associativity issue.

   So if you stated or assumed a definition of "subpattern",
   but didn't _explicitly_ consider associativity, then
   we can't look to your definition to answer any questions about
   associativity: your intent might have decided associativity
   simply by accident.

D. What the historic implementors clearly meant.

   The implementors are largely silent on this issue. 

   (Am I the sole exception?  I made Rx left associative a few years
   ago as a deliberate choice, after studying the standard.)

   When giving weight to what a historic implementation does
   wrt. to associativity, the comments above for constraint
   (C) apply here as well:  any decision an implementor made
   regarding associativity might very well have been an accidental
   one.

E. What makes the most sense.

   We can argue forever about constraints A..D, or we can pick the
   answer that makes the most sense.  Constraint E is there for
   just such cases where A..D can't resolve the issue.

   For the associativity question, I know of no answer that makes more
   sense than to give regexps the useful algebraic properties that
   follow from the left associativity of concatenation.  More on
   this in the next message.

<Prev in Thread] Current Thread [Next in Thread>