Subject: Re: Re: starting point
--------
> On Thu, 23 May 2002 11:02:16 -0400 (EDT) David Korn wrote:
> > o Interp #43, Part 14. What does it mean by the left-
> > to-right order in a match on line 5908? For example,
> > with pattern
> > ((..)*(.....)*)*
> > and string xxxxx, what should \1 be? Lines 5911 and
> > 5908 would give contradictory answers.
> > ACTION:
> > To resolve Interp #43, Part 14, add on XBD page 167, Section
> > 9.1 after sentence ending on line 5908,
> > "An enclosed subpattern is deemed to be to the right of
> > an enclosing pattern."
>
> could you post the original lines 5908-5911
> in the copy I have 5911 deals with multi-character collating elements
>
> -- Glenn Fowler <yyy@xxxxxxxxxxxxxxxx> AT&T Labs Research, Florham Park NJ --
>
>
Here are the lines 5907-5911
Consistent with the whole match being the longest of the leftmost
matches, each subpattern, from left to right, shall match the longest
possible string. For this purpose, a null string shall be
considered to be longer than no match at all. For example, the
BRE "\(.*\).*" matched against "abcdef", the subexpression "(\1)"
is "abcdef", and matching the BRE "\(a*\)*" against "bc", the
subexpression "(\1)" is the null string.
I am also enclosing the minutes of the RE experts meeting. Note,
that all the line numbers here refer to the 1992 standard.
=====================cut here===========================
Minutes of the RE experts meeting
Toronto, Canada
June 29, 1995
In attendance:
Mark Funkenhauser
David Korn
Doug McIlroy
Rodney Ruddock
Henry Spencer
The purpose of the meeting was to resolve RE issues for the
POSIX 1003.2b standard. The agenda was to go over issues
related to interpretations and to resolve other issues that
have been identified during implementation. I18N issues
where handled last.
o Interp #43, Part 15. To which match is a backreference
to a duplicated subexpression bound? To resolve Interp
#43, Part 15, add on page 17, Section 2.8.3.3 of
P1003.2b/D11 after line 400, "When a referenced
subexpression does not match any string (not even the
empty string), the backreference expression fails to
match. When subexpressions are nested, the substrings
matching them are similarly nested. When a contained
subexpression fails to participate in the last match of
its containing subexpression, backreferences to the
contained subexpression fail to match."
For example
\(a*\(b\)*\)\{2\}\2
fails to match
ba
o Interp #43, Part 12. Can a duplicated subexpression
match the null string? If so, will the duplication be
repeated until the expression does match the null
string? There was a consensus that applying a
duplicator to a RE that could match the empty string
should be unspecified. However, if specified, the
specification in P1003.2b/D11 is incorrect and should
be changed. On page 17, Section 2.8.3.3 of
P1003.2b/D11, change lines 407-408 to, "When a
subexpression or a backreference is repeated by an
asterisk(*) or an interval expression, the
subexpression shall not match a null expression unless
it is". Also, change lines 413-416 of the same section
to, "When an ERE enclosed in parentheses is repeated by
a *, ?, +, or an interval expression, the ERE enclosed
in parentheses shall not match the empty string unless
it is necessary to satisfy the exact or minimum number
of occurrences for the + or interval expression."
o Interp #43, Part 14. What does it mean by the left-
to-right order in a match on line 2792? For example,
with pattern
((..)*(.....)*)*
and sting xxxxx, what should \1 be? Lines 2976 and
2792 would give contradictory answers.
To resolve Interp #43, Part 14, add on page 77, Section
2.8.2 of P1003.2 after sentence ending on line 2792,
"An enclosed subpattern is deemed to be to the right of
an enclosing pattern." On page 82, Section 2.8.3.3 of
P1003.2 , change line 2971 to "The following rules, in
conjunction with the general requirements of 2.8.2,
shell be used to construct BREs matching multiple
characters". On page 82, Section 2.8.3.3 of P1003.2 ,
line 2976 replace "whatever" with "any string". On
page 85, Section 2.8.4.3 of P1003.2 , change line 3090
to "The following rules, in conjunction with the
general requirements of 2.8.2, shell be used to
construct EREs matching multiple characters". On page
85, Section 2.8.4.3 of P1003.2 , line 3094 replace
"whatever" with "any string".
o Interp #44. There was unanimous agreement that the
error numbers must be unique.
o Interp #45. Current interp and change OK.
o Interp #60. This needs to be fixed in .1.
o Interp #73. Current interp and change OK.
o Interp #82. The current interpretation is incorrect.
In section 2.8.3.3, lines 2980-2981 the standard says
that a backreference matches a "string of characters".
Therefore, the standard requires that the expression
\(^b\)\1 must match the first two characters of bbbb.
o Interp #85. Agreement except that dumping core should
not be allowed for bad expressions. Therefore on lines
2833, 3055, 3065, 3070, and 3077, undefined should be
changed to unspecified.
o Interp #86. Current interp and change OK.
o Interp #88. Wording for interp #45, part 15 should
take care of this.
o Interp #125. What is the meaning of BRE\{0,0\}? The
current wording leaves the behavior unspecified. On
page 82, Section 2.8.3.3 of P1003.2, add after end of
sentence on line 2992, "Zero occurrences of a BRE match
the empty string". On page 85, Section 2.8.4.3 of
P1003.2, add after end of sentence on line 3107, "Zero
occurrences of an ERE match the empty string". Note
that this added sentence must also apply to parts (3),
(4) and (5).
o Doug #6. Does case folding apply to backreferences?
On page 82, Section 2.8.3.3 of P1003.2, add after line
2988, "When pattern matching is being performed without
regard to case, the backreference match will occur
without regard to case."
Also, On page 80, Section 2.8.3.2 of P1003.2, add after
end of sentence on line 2891, "Whenever pattern
matching is being performed without regard to case,
each character or collating element shall be deemed to
stand for itself and all its case counterparts. On
page 78, Section 2.8.2 of P1003.2, line 2817 change
"counterpart" to "counterparts". It wasn't clear
whether there is such a thing as upper and lower
multi-character collating elements.
o I18N. lot of confusion about the use of character or
collating element. Doesn't collating element include
character? Usage is inconsistent.
o Interp #41, Part 7. On page 80, Section 2.8.3.2 of
P1003.2, change line 2920 to "Shall represent the set
containing only that collating element".
o Henry #1. The RE a)b should be unspecified. An interp
will be submitted by Henry Spencer. To resolve, on
page 84, Section 2.8.4.1.2 of P1003.2, line 3062 add ).
Delete lines 3066-3067. Also, on page 88, section
2.8.5.1, delete lines 3221-3222.
o Interp #27. The group believes that the resolution
makes no sense. Since there is no interface to get the
collating sequence order, it is impossible to tell what
the correct answer is. Given the sentence on page 81,
lines 2936-2937, that "Range expressions shall not be
used in Strictly Conforming POSIX.2 Applications
because their behavior is dependent on the collating
sequence", it appears that the primary reason for range
expressions is for backwards compatibility. However,
the use of the character order sequence makes backwards
compatibility less likely since this order is
arbitrary. The unanimous recommendation of the RE
group is to restrict range endpoints to characters, and
to match all the collating elements between these
endpoints. For example, given the range [a-z], any
collating element c, such that strcoll("a","c")>=0 and
strcoll("z","c")<=0 would be in this range. An
alternative would be to use the character ordinal
value.
o Interp #29. Lines 188-191 of interp wrong if collating
elements.
o Interp #40. Having bracket expressions matching
collating elements causes problems since ([[:alpha:]])*
won't match the collating element .ch.. Some
discussion of saying that bracket expressions that
don't use [.x.] would match collating elements and
those that do not would match characters. Handling
equivalence classes would be tricky. Could have [=x=]
imply collating elements, or use [[=x=][.x]] to match
collating elements, or add [.=x=.] to match collating
elements.
On page 52, Section 2.5.2.2 of P1003.2, line 1668 it
says that the two strings are first broken up into
collating elements, but doesn't explain how this
happens. Add somewhere in 2.5.22, "The first collating
element of a string is the largest possible collating
element."
o Interp #41. Dot1 needs to add interfaces for getting
next collating element from a string, getting
equivalence classes, and getting names of collating
elements.
o Interp #41. Part 8 is answered on page 729, section
B.5.2 lines 367-368.
o Interp #41. On page 81, Section 2.8.3.2 of P1003.2,
lines 2961-2963, change characters to collating
elements.
David Korn
=====================cut here===========================
David Korn
research!dgk
yyy@xxxxxxxxxxxxxxxx
|