Email List: Xaustin-regexp-lX
[All Lists]

Re: Re: starting point

To: yyyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Re: Re: starting point
From: David Korn <yyy@xxxxxxxxxxxxxxxx>
Date: Tue, 28 May 2002 10:41:38 -0400 (EDT)
Subject: Re: Re: starting point
--------

> On Thu, 23 May 2002 11:02:16 -0400 (EDT) David Korn wrote:
> >           o Interp #43, Part 14.  What does it mean by the left-
> >             to-right order in a match on line 5908?  For example,
> >             with pattern
> >                     ((..)*(.....)*)*
> >             and string xxxxx, what should \1 be?  Lines 5911 and
> >             5908 would give contradictory answers.
> > ACTION:
> >             To resolve Interp #43, Part 14, add on XBD page 167, Section
> >             9.1 after sentence ending on line 5908,
> >             "An enclosed subpattern is deemed to be to the right of
> >             an enclosing pattern."
> 
> could you post the original lines 5908-5911
> in the copy I have 5911 deals with multi-character collating elements
> 
> -- Glenn Fowler <yyy@xxxxxxxxxxxxxxxx> AT&T Labs Research, Florham Park NJ --
> 
> 

Here are the lines 5907-5911

        Consistent with the whole match being the longest of the leftmost
        matches, each subpattern, from left to right, shall match the longest
        possible string.  For this purpose, a null string shall be
        considered to be longer than no match at all.  For example, the
        BRE "\(.*\).*" matched against "abcdef", the subexpression "(\1)"
        is "abcdef", and matching the BRE "\(a*\)*" against "bc", the
        subexpression "(\1)" is the null string.

I am also enclosing the minutes of the RE experts meeting.  Note,
that all the line numbers here refer to the 1992 standard.

=====================cut here===========================

                    Minutes of the RE experts meeting
                             Toronto, Canada
                              June 29, 1995

       In attendance:
            Mark Funkenhauser
            David Korn
            Doug McIlroy
            Rodney Ruddock
            Henry Spencer



       The purpose of the meeting was to resolve RE issues for the
       POSIX 1003.2b standard.  The agenda was to go over issues
       related to interpretations and to resolve other issues that
       have been identified during implementation.  I18N issues
       where handled last.

          o Interp #43, Part 15.  To which match is a backreference
            to a duplicated subexpression bound?  To resolve Interp
            #43, Part 15, add on page 17, Section 2.8.3.3 of
            P1003.2b/D11 after line 400, "When a referenced
            subexpression does not match any string (not even the
            empty string), the backreference expression fails to
            match.  When subexpressions are nested, the substrings
            matching them are similarly nested.  When a contained
            subexpression fails to participate in the last match of
            its containing subexpression, backreferences to the
            contained subexpression fail to match."

            For example
                    \(a*\(b\)*\)\{2\}\2
            fails to match
                    ba

          o Interp #43, Part 12.  Can a duplicated subexpression
            match the null string?  If so, will the duplication be
            repeated until the expression does match the null
            string?  There was a consensus that applying a
            duplicator to a RE that could match the empty string
            should be unspecified.  However, if specified, the
            specification in P1003.2b/D11 is incorrect and should
            be changed.  On page 17, Section 2.8.3.3 of
            P1003.2b/D11, change lines 407-408 to, "When a
            subexpression or a backreference is repeated by an
            asterisk(*) or an interval expression, the
            subexpression shall not match a null expression unless
            it is".  Also, change lines 413-416 of the same section
            to, "When an ERE enclosed in parentheses is repeated by
            a *, ?, +, or an interval expression, the ERE enclosed
            in parentheses shall not match the empty string unless
            it is necessary to satisfy the exact or minimum number
            of occurrences for the + or interval expression."

          o Interp #43, Part 14.  What does it mean by the left-
            to-right order in a match on line 2792?  For example,
            with pattern
                    ((..)*(.....)*)*
            and sting xxxxx, what should \1 be?  Lines 2976 and
            2792 would give contradictory answers.

            To resolve Interp #43, Part 14, add on page 77, Section
            2.8.2 of P1003.2 after sentence ending on line 2792,
            "An enclosed subpattern is deemed to be to the right of
            an enclosing pattern."  On page 82, Section 2.8.3.3 of
            P1003.2 , change line 2971 to "The following rules, in
            conjunction with the general requirements of 2.8.2,
            shell be used to construct BREs matching multiple
            characters".  On page 82, Section 2.8.3.3 of P1003.2 ,
            line 2976 replace "whatever" with "any string".  On
            page 85, Section 2.8.4.3 of P1003.2 , change line 3090
            to "The following rules, in conjunction with the
            general requirements of 2.8.2, shell be used to
            construct EREs matching multiple characters".  On page
            85, Section 2.8.4.3 of P1003.2 , line 3094 replace
            "whatever" with "any string".

          o Interp #44.  There was unanimous agreement that the
            error numbers must be unique.

          o Interp #45.  Current interp and change OK.

          o Interp #60.  This needs to be fixed in .1.

          o Interp #73.  Current interp and change OK.

          o Interp #82.  The current interpretation is incorrect.
            In section 2.8.3.3, lines 2980-2981 the standard says
            that a backreference matches a "string of characters".
            Therefore, the standard requires that the expression
            \(^b\)\1 must match the first two characters of bbbb.

          o Interp #85.  Agreement except that dumping core should
            not be allowed for bad expressions.  Therefore on lines
            2833, 3055, 3065, 3070, and 3077, undefined should be
            changed to unspecified.

          o Interp #86.  Current interp and change OK.

          o Interp #88.  Wording for interp #45, part 15 should
            take care of this.

          o Interp #125.  What is the meaning of BRE\{0,0\}?  The
            current wording leaves the behavior unspecified.  On
            page 82, Section 2.8.3.3 of P1003.2, add after end of
            sentence on line 2992, "Zero occurrences of a BRE match
            the empty string".  On page 85, Section 2.8.4.3 of
            P1003.2, add after end of sentence on line 3107, "Zero
            occurrences of an ERE match the empty string".  Note
            that this added sentence must also apply to parts (3),
            (4) and (5).

          o Doug #6.  Does case folding apply to backreferences?
            On page 82, Section 2.8.3.3 of P1003.2, add after line
            2988, "When pattern matching is being performed without
            regard to case, the backreference match will occur
            without regard to case."

            Also, On page 80, Section 2.8.3.2 of P1003.2, add after
            end of sentence on line 2891, "Whenever pattern
            matching is being performed without regard to case,
            each character or collating element shall be deemed to
            stand for itself and all its case counterparts.  On
            page 78, Section 2.8.2 of P1003.2, line 2817 change
            "counterpart" to "counterparts".  It wasn't clear
            whether there is such a thing as upper and lower
            multi-character collating elements.

          o I18N.  lot of confusion about the use of character or
            collating element.  Doesn't collating element include
            character?  Usage is inconsistent.

          o Interp #41, Part 7.  On page 80, Section 2.8.3.2 of
            P1003.2, change line 2920 to "Shall represent the set
            containing only that collating element".

          o Henry #1.  The RE a)b should be unspecified.  An interp
            will be submitted by Henry Spencer.  To resolve, on
            page 84, Section 2.8.4.1.2 of P1003.2, line 3062 add ).
            Delete lines 3066-3067.  Also, on page 88, section
            2.8.5.1, delete lines 3221-3222.

          o Interp #27.  The group believes that the resolution
            makes no sense.  Since there is no interface to get the
            collating sequence order, it is impossible to tell what
            the correct answer is.  Given the sentence on page 81,
            lines 2936-2937, that "Range expressions shall not be
            used in Strictly Conforming POSIX.2 Applications
            because their behavior is dependent on the collating
            sequence", it appears that the primary reason for range
            expressions is for backwards compatibility.  However,
            the use of the character order sequence makes backwards
            compatibility less likely since this order is
            arbitrary.  The unanimous recommendation of the RE
            group is to restrict range endpoints to characters, and
            to match all the collating elements between these
            endpoints.  For example, given the range [a-z], any
            collating element c, such that strcoll("a","c")>=0 and
            strcoll("z","c")<=0 would be in this range.  An
            alternative would be to use the character ordinal
            value.

          o Interp #29.  Lines 188-191 of interp wrong if collating
            elements.

          o Interp #40.  Having bracket expressions matching
            collating elements causes problems since ([[:alpha:]])*
            won't match the collating element .ch..  Some
            discussion of saying that bracket expressions that
            don't use [.x.] would match collating elements and
            those that do not would match characters.  Handling
            equivalence classes would be tricky.  Could have [=x=]
            imply collating elements, or use [[=x=][.x]] to match
            collating elements, or add [.=x=.] to match collating
            elements.

            On page 52, Section 2.5.2.2 of P1003.2, line 1668 it
            says that the two strings are first broken up into
            collating elements, but doesn't explain how this
            happens.  Add somewhere in 2.5.22, "The first collating
            element of a string is the largest possible collating
            element."

          o Interp #41.  Dot1 needs to add interfaces for getting
            next collating element from a string, getting
            equivalence classes, and getting names of collating
            elements.

          o Interp #41.  Part 8 is answered on page 729, section
            B.5.2 lines 367-368.

          o Interp #41.  On page 81, Section 2.8.3.2 of P1003.2,
            lines 2961-2963, change characters to collating
            elements.


       David Korn

=====================cut here===========================
        

David Korn
research!dgk
yyy@xxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>