Email List: Xaustin-group-lX
[All Lists]

Re: RE-ASSOC: a question about the associativity of RE concatenation

To: yyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Re: RE-ASSOC: a question about the associativity of RE concatenation
From: David Korn <yyy@xxxxxxxxxxxxxxxx>
Date: Tue, 9 Apr 2002 14:57:22 -0400 (EDT)
Cc: yyyyyy@xxxxxxxxxxx, yyy@xxxxxxxxxxxxxxxx, yyyyy@xxxxxxxxxx
> RE-ASSOC: a question about the associativity of RE concatenation
> 
> [Some of the people in the GNU projects (Paul Eggert, Tom Lord,
> Isamu Hasegawa, myself) were considering consolidating the various
> regex implementations in GNU, and came up with a series of questions
> concerning the Specification. This is one of those questions.]
> 
> Must pattern matching strictly follow the left-associative POSIX
> grammar for regular expressions, or is pattern matching allowed to
> behave as if the grammar had been right-associative?
> 
> This question follows up the earlier question RE-CONCAT; please see
> the discussion of the earlier question for much of the background
> behind this question.
> 
> The POSIX grammar for EREs contains this production:
> 
>     ERE_branch :            ERE_expression
>                | ERE_branch ERE_expression
> 
>     
> 
><http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05_03>
> 
> This grammar is left-associative, which means that the ERE /ABC/ is
> parsed as if it were /(AB)C/, aside from parenthesis numbering.
> 
> In the REs of classical formal language theory developed by Stephen
> Kleene and others, the concatenation of REs is associative, in the
> sense that the REs /(AB)C/ and /A(BC)/ describe the same regular sets.
> But POSIX REs might have different possible interpretations depending
> on how one associates subpatterns to matched strings.
> 
> For example, suppose we match the ERE /(week|wee)(night|knights)(s*)/
> to the string "weeknights", and want to know which subpatterns match
> which substrings.  Here are some possible interpretations:
> 
>   (1a) The "subpatterns" are the two immediate subexpressions,
>       interpreted according to the grammar, namely
>       /(week|wee)(night|knights)/ and /(s*).  The longest consistent
>       match for /(week|wee)(night|knights)/ is "weeknights".
>       Therefore, /(s*)/ matches the empty string.
> 
>   (1b) Even though the grammar is left-associative, the matching
>       semantics are right-associative.  The "subpatterns" are the two
>       subexpressions /(week|wee)/ and /(night|knights)(s*)/.  The
>       longest consistent match for /(week|wee)/ is "week".  We then
>       apply the rule recursively, and find that the longest consistent
>       match for /(night|knights)/ against "nights" is "night".
>       Therefore, /(s*)/ match "s".
> 
>   (2) The "subpatterns" are all the subexpressions, not just the
>       concatenated subexpressions (see interpretation (2) of question
>       RE-CONCAT), so the matching semantics are inherently
>       right-associative.  This returns the same match as (1b), but for
>       the slightly different ERE /(wee|week)(night|knights)(s*)/, (1b)
>       still says /(s*)/ matches "s" but (2) says /(s*)/ matches the
>       empty string.
> 
>   (3) The standard is ambiguous, and some or all the above
>       interpretations are valid.  Perhaps other interpretations are
>       valid as well.  A conforming implementation might even use
>       different interpretations at different times.
> 
> (1a) and (1b) are both consistent with interpretation (1) of question
> RE-CONCAT.  Note that a user can override either (1a) or (1b) by using
> explicit parenthesization.  (2) is consistent with RE-CONCAT
> interpretations (2); under (2), explicit parenthesization is
> irrelevant.
> 
> Here are some arguments for and against these interpretations.  These
> lengthy arguments are briefly summarized at the end of this section.
> 
>    pro (1a):
> 
>      The most natural interpretation of the standard and rationale is
>      that the implementation must recurse through the regular
>      expression's parse tree, as indicated by the grammar.
> 
>      The ERE must be parsed strictly according to the grammar, because
>      the standard says that the ERE "grammar takes precedence 
> over the text".
>      
> 
><http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05>
> 
>    con (1a):
> 
>      (1a) is counterintuitive, because it says that when matching
>      /XYZ/, it is more important to maximize the length of the match
>      for /XY/ than it is to maximize the length of the match for /X/.
>      This contradicts the left-to-right nature of the subpattern rule.
> 
>      Had the POSIX authors intended (1a), they would have written the
>      grammar to be right-associative, to avoid this counterintuitive
>      behavior of (1a).
> 
>    pro (1a):
> 
>      Had the POSIX authors intended that it be more important to
>      maximize the length of /X/ than to maximize the length of /XY/
>      when matching /XYZ/, they could easily done so by using a
>      right-associative grammar for ERE_branch.  However, they used a
>      left-associative grammar, indicating that (1a) was intended over
>      (1b).
> 
>      This is reminiscent of the situation in the C programming
>      language.  Classical addition is associative, but Standard C
>      requires that X+Y+Z must be evaluated according to C's
>      left-associative grammar, and must therefore be evaluated as
>      (X+Y)+Z and not as X+(Y+Z).  The distinction matters, for example,
>      if floating-point arithmetic is used, due to rounding errors.
>      Similarly, classical RE concatenation is associative, but POSIX RE
>      concatenation is not, due to the side effects of subpattern
>      matching, so it is important to follow the associativity rules
>      specified by the grammar.
> 
>    con (1a):
> 
>      The C Standard explicitly says that X+Y+Z must be evaluated as
>      (X+Y)+Z, whereas the POSIX Standard does not explicitly say how to
>      follow the RE grammar when determining matched subpatterns.
>      Admittedly the X+Y+Z statement in the C Standard is a
>      non-normative footnote, but the POSIX authors could also have
>      added a footnote if they had wanted to make it clear that (1a) was
>      intended.
> 
>    pro (1b) and (2):
> 
>      The semantics should not be contorted by the particular method used
>      to define the syntax.  The standard says:
> 
>        Portions of this volume of IEEE Std 1003.1-2001 are 
> expressed in terms
>        of a special grammar notation. It is used to portray the complex
>        syntax of certain program input. The grammar is based on 
> the syntax
>        used by the yacc utility. However, it does not represent fully
>        functional yacc input, suitable for program use; the lexical
>        processing and all semantic requirements are described 
> only in textual
>        form. The grammar is not based on source used in any traditional
>        implementation and has not been tested with the semantic code that
>        would normally be required to accompany it.
>        
> 
><http://www.opengroup.org/onlinepubs/007904975/utilities/xcu_chap01.html#tag_01_10>
> 
>      and this indicates that the grammar should not be taken overly
>      seriously when interpreting the semantics.
> 
>    pro (1a):
> 
>      Some POSIX matchers use (1a).  These include the Rx matcher and the
>      Spencer-derived matchers used by Tcl and others.
> 
>    See also the arguments pro and con RE-CONCAT's interpretations
>    (1), (2), and (3).
> 
> The bottom line of this discussion:
> 
>    This question is probably moot unless RE-CONCAT interpretation (1) is
>    chosen, in which case the realistic choices are (1a) or (1b).
> 
>    (1a) follows the POSIX standard more strictly, but (1b) typically
>    matches user intent better.  The differences between (1a) and (1b)
>    rarely arise in practice, so it wouldn't affect applications much if
>    either interpretation were chosen.  A user can override either (1a)
>    or (1b) by using explicit parenthesization.
> 
> Suggestion:
> 
>    Modify the POSIX grammar so that RE concatenation is
>    right-associative.  This would cause (1a) to have the semantics of
>    (1b), thus unifying the two interpretations and removing the
>    confusion.
> 
> 
> 
> ---
> Mark S. Brown
> Senior Technical Staff Member, IBM Server Group
> 512.838.3926 fax 512.838.3882
> yyyyy@xxxxxxxxxx              
> 
> 

I believe that this issue was resolved in the June 1995 POSIX RE experts
meeting in Toronto which I chaired.  I have enclose the minutes
below. 

In the case of
        /(week|wee)(night|knights)(s*)/
Once the longest leftmost match of the complete string is found, "weeknights",
which could be matched by or
        wee knights
        week night s
The longest leftmost match of the leftmost parenthesised group is
matched.  This, the result is
        week night s

Notice that nested subexpressions are to be considered to the right
of the expression that it nests.

===============cut here======================



                                  - 1 -



                    Minutes of the RE experts meeting
                             Toronto, Canada
                              June 29, 1995

       In attendance:
            Mark Funkenhauser
            David Korn
            Doug McIlroy
            Rodney Ruddock
            Henry Spencer



       The purpose of the meeting was to resolve RE issues for the
       POSIX 1003.2b standard.  The agenda was to go over issues
       related to interpretations and to resolve other issues that
       have been identified during implementation.  I18N issues
       where handled last.

          o Interp #43, Part 15.  To which match is a backreference
            to a duplicated subexpression bound?  To resolve Interp
            #43, Part 15, add on page 17, Section 2.8.3.3 of
            P1003.2b/D11 after line 400, "When a referenced
            subexpression does not match any string (not even the
            empty string), the backreference expression fails to
            match.  When subexpressions are nested, the substrings
            matching them are similarly nested.  When a contained
            subexpression fails to participate in the last match of
            its containing subexpression, backreferences to the
            contained subexpression fail to match."

            For example
                    \(a*\(b\)*\)\{2\}\2
            fails to match
                    ba

          o Interp #43, Part 12.  Can a duplicated subexpression
            match the null string?  If so, will the duplication be
            repeated until the expression does match the null
            string?  There was a consensus that applying a
            duplicator to a RE that could match the empty string
            should be unspecified.  However, if specified, the
            specification in P1003.2b/D11 is incorrect and should
            be changed.  On page 17, Section 2.8.3.3 of
            P1003.2b/D11, change lines 407-408 to, "When a
            subexpression or a backreference is repeated by an
            asterisk(*) or an interval expression, the
            subexpression shall not match a null expression unless
            it is".  Also, change lines 413-416 of the same section
            to, "When an ERE enclosed in parentheses is repeated by
            a *, ?, +, or an interval expression, the ERE enclosed











                                  - 2 -



            in parentheses shall not match the empty string unless
            it is necessary to satisfy the exact or minimum number
            of occurrences for the + or interval expression."

          o Interp #43, Part 14.  What does it mean by the left-
            to-right order in a match on line 2792?  For example,
            with pattern
                    ((..)*(.....)*)*
            and sting xxxxx, what should \1 be?  Lines 2976 and
            2792 would give contradictory answers.

            To resolve Interp #43, Part 14, add on page 77, Section
            2.8.2 of P1003.2 after sentence ending on line 2792,
            "An enclosed subpattern is deemed to be to the right of
            an enclosing pattern."  On page 82, Section 2.8.3.3 of
            P1003.2 , change line 2971 to "The following rules, in
            conjunction with the general requirements of 2.8.2,
            shall be used to construct BREs matching multiple
            characters".  On page 82, Section 2.8.3.3 of P1003.2 ,
            line 2976 replace "whatever" with "any string".  On
            page 85, Section 2.8.4.3 of P1003.2 , change line 3090
            to "The following rules, in conjunction with the
            general requirements of 2.8.2, shall be used to
            construct EREs matching multiple characters".  On page
            85, Section 2.8.4.3 of P1003.2 , line 3094 replace
            "whatever" with "any string".

          o Interp #44.  There was unanimous agreement that the
            error numbers must be unique.

          o Interp #45.  Current interp and change OK.

          o Interp #60.  This needs to be fixed in .1.

          o Interp #73.  Current interp and change OK.

          o Interp #82.  The current interpretation is incorrect.
            In section 2.8.3.3, lines 2980-2981 the standard says
            that a backreference matches a "string of characters".
            Therefore, the standard requires that the expression
            \(^b\)\1 must match the first two characters of bbbb.

          o Interp #85.  Agreement except that dumping core should
            not be allowed for bad expressions.  Therefore on lines
            2833, 3055, 3065, 3070, and 3077, undefined should be
            changed to unspecified.

          o Interp #86.  Current interp and change OK.

          o Interp #88.  Wording for interp #45, part 15 should
            take care of this.











                                  - 3 -



          o Interp #125.  What is the meaning of BRE\{0,0\}?  The
            current wording leaves the behavior unspecified.  On
            page 82, Section 2.8.3.3 of P1003.2, add after end of
            sentence on line 2992, "Zero occurrences of a BRE match
            the empty string".  On page 85, Section 2.8.4.3 of
            P1003.2, add after end of sentence on line 3107, "Zero
            occurrences of an ERE match the empty string".  Note
            that this added sentence must also apply to parts (3),
            (4) and (5).

          o Doug #6.  Does case folding apply to backreferences?
            On page 82, Section 2.8.3.3 of P1003.2, add after line
            2988, "When pattern matching is being performed without
            regard to case, the backreference match will occur
            without regard to case."

            Also, On page 80, Section 2.8.3.2 of P1003.2, add after
            end of sentence on line 2891, "Whenever pattern
            matching is being performed without regard to case,
            each character or collating element shall be deemed to
            stand for itself and all its case counterparts.  On
            page 78, Section 2.8.2 of P1003.2, line 2817 change
            "counterpart" to "counterparts".  It wasn't clear
            whether there is such a thing as upper and lower
            multi-character collating elements.

          o I18N.  lot of confusion about the use of character or
            collating element.  Doesn't collating element include
            character?  Usage is inconsistent.

          o Interp #41, Part 7.  On page 80, Section 2.8.3.2 of
            P1003.2, change line 2920 to "Shall represent the set
            containing only that collating element".

          o Henry #1.  The RE a)b should be unspecified.  An interp
            will be submitted by Henry Spencer.  To resolve, on
            page 84, Section 2.8.4.1.2 of P1003.2, line 3062 add ).
            Delete lines 3066-3067.  Also, on page 88, section
            2.8.5.1, delete lines 3221-3222.

          o Interp #27.  The group believes that the resolution
            makes no sense.  Since there is no interface to get the
            collating sequence order, it is impossible to tell what
            the correct answer is.  Given the sentence on page 81,
            lines 2936-2937, that "Range expressions shall not be
            used in Strictly Conforming POSIX.2 Applications
            because their behavior is dependent on the collating
            sequence", it appears that the primary reason for range
            expressions is for backwards compatibility.  However,
            the use of the character order sequence makes backwards
            compatibility less likely since this order is











                                  - 4 -



            arbitrary.  The unanimous recommendation of the RE
            group is to restrict range endpoints to characters, and
            to match all the collating elements between these
            endpoints.  For example, given the range [a-z], any
            collating element c, such that strcoll("a","c")>=0 and
            strcoll("z","c")<=0 would be in this range.  An
            alternative would be to use the character ordinal
            value.

          o Interp #29.  Lines 188-191 of interp wrong if collating
            elements.

          o Interp #40.  Having bracket expressions matching
            collating elements causes problems since ([[:alpha:]])*
            won't match the collating element .ch..  Some
            discussion of saying that bracket expressions that
            don't use [.x.] would match collating elements and
            those that do not would match characters.  Handling
            equivalence classes would be tricky.  Could have [=x=]
            imply collating elements, or use [[=x=][.x]] to match
            collating elements, or add [.=x=.] to match collating
            elements.

            On page 52, Section 2.5.2.2 of P1003.2, line 1668 it
            says that the two strings are first broken up into
            collating elements, but doesn't explain how this
            happens.  Add somewhere in 2.5.22, "The first collating
            element of a string is the largest possible collating
            element."

          o Interp #41.  Dot1 needs to add interfaces for getting
            next collating element from a string, getting
            equivalence classes, and getting names of collating
            elements.

          o Interp #41.  Part 8 is answered on page 729, section
            B.5.2 lines 367-368.

          o Interp #41.  On page 81, Section 2.8.3.2 of P1003.2,
            lines 2961-2963, change characters to collating
            elements.




       David Korn













===============cut here======================


David Korn
research!dgk
yyy@xxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>