I am the chair for the new regular expression subgroup of
the Austin group. Or charter is to resolve three aardvarks
that have been submitted to correct defects in the current
definition of regular expressions and to the regcmp()/regex()
interface in the standard.
I want to start with ERN 17 which I have included below. It
deals with issues that were debated and resolved at a meeting of
regular expression experts held in Toronto in 1995 that were
never merged into the current standard. I don't expect
that there are many controversal issues here, but we need to
make sure that it is correct and unambiguous.
The issues with ERN 18 and ERN 19 are interrelated and offer
contradictory views and therefore should be discussed simultaneously
after we have completed ERN 17.
On useful suggestion that I have received from Glenn Fowler,
is to have an open source test harness that can be used
by all conforming implementations to verify all examples
we discuss or are in the standard. I will ask Glenn to
present his proposal to this group.
_____________________________________________________________________________
OBJECTION Enhancement Request Number 17
yyyyyy@xxxxxxxxxxx Defect in XBD Regular Expressions (rdvk# 43)
{eggert20020430a} Wed, 1 May 2002 00:17:32 +0100 (BST)
_____________________________________________________________________________
Accept_____ Accept as marked below_X___ Duplicate_____ Reject_____
Rationale for rejected or partial changes:
This is being deferred to a new subgroup.
_____________________________________________________________________________
Page: 167 Line: 5889 Section: Regular
Problem:
Defect code : 3. Clarification required
The Regular Expressions section
<http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09>
does not reflect the resolutions of the June 1995 POSIX RE experts
meeting as reported by David Korn in
<http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-group
-l&id=3713>
Action:
Adopt the interpretations of the June 1995 POSIX RE experts meeting as
referenced above, with the exceptions of the I18N interpretations
(i.e, those interpretations starting with Interp #41, Part 7 and
continuing to the end of the meeting notes). The I18N interpretations
are now somewhat obsolete, since that part of the standard was changed
in POSIX 1003.1-2001. However, the other interpretations are to parts
of the standard that have not changed, so they are still relevant.
[Ed recommendation: None
(From X-Mailing-List: austin-group-l:archive/latest/3971)
In the copy below, I have edited the page, section, and line numbers
so that they refer to the 2001 final text, and (as I mentioned
earlier) I am omitting the I18N-related actions. Also, for ease of
reference I am placing an "ACTION:" annotation at the start of each
recommended action. These are the only changes that I made.
o Interp #43, Part 15. To which match is a backreference
to a duplicated subexpression bound?
ACTION:
To resolve Interp
#43, Part 15, add on XBD page 172, Section 9.3.6(3)
after line 6109, "When a referenced
subexpression does not match any string (not even the
empty string), the backreference expression fails to
match. When subexpressions are nested, the substrings
matching them are similarly nested. When a contained
subexpression fails to participate in the last match of
its containing subexpression, backreferences to the
contained subexpression fail to match."
For example
\(a*\(b\)*\)\{2\}\2
fails to match
ba
o Interp #43, Part 12. Can a duplicated subexpression
match the null string? If so, will the duplication be
repeated until the expression does match the null
string? There was a consensus that applying a
duplicator to a RE that could match the empty string
should be unspecified. However, if specified, the
specification in P1003.1-2001 is incorrect and should
be changed.
ACTION:
On XBD page 172, Section 9.3.6, change line 6127 to, "When a
subexpression or a backreference is repeated by an
asterisk(*) or an interval expression, the
subexpression shall not match a null".
Also, change XBD page 175 section 9.4.6 lines 6239-6241
to, "When an ERE enclosed in parentheses is repeated by
a *, ?, +, or an interval expression, the ERE enclosed
in parentheses shall not match the empty string unless
it is necessary to satisfy the exact or minimum number
of occurrences for the + or interval expression."
o Interp #43, Part 14. What does it mean by the left-
to-right order in a match on line 5908? For example,
with pattern
((..)*(.....)*)*
and string xxxxx, what should \1 be? Lines 5911 and
5908 would give contradictory answers.
ACTION:
To resolve Interp #43, Part 14, add on XBD page 167, Section
9.1 after sentence ending on line 5908,
"An enclosed subpattern is deemed to be to the right of
an enclosing pattern." On XBD page 172, Section 9.3.6,
change lines 6090-6091 to "The following rules, in
conjunction with the general requirements of Sections 9.1 and 9.2,
shall be used to construct BREs matching multiple
characters". On XBD page 172, Section 9.3.6(2),
line 6095 replace "whatever" with "any string". On XBD
page 175, Section 9.4.6, change lines 6205-6206
to "The following rules, in conjunction with the
general requirements of Sections 9.1 and 9.2, shall be used to
construct EREs matching multiple characters". On XBD page
175, Section 9.4.6(1), line 6209 replace
"whatever" with "any string".
o Interp #44. There was unanimous agreement that the
error numbers must be unique.
o Interp #45. Current interp and change OK.
o Interp #60. This needs to be fixed in .1.
o Interp #73. Current interp and change OK.
o Interp #82. The current interpretation is incorrect.
In section 9.3.6(3), lines 6098-6099 the standard says
that a backreference matches a "string of characters".
Therefore, the standard requires that the expression
\(^b\)\1 must match the first two characters of bbbb.
o Interp #85. Agreement except that dumping core should
not be allowed for bad expressions.
ACTION:
Therefore on XBD section 9 lines 5927, 5942, 5970, 5982, 6075,
6125, 6171, 6180, 6185, 6193, 6238, 6468, undefined should be
changed to unspecified.
o Interp #86. Current interp and change OK.
o Interp #88. Wording for interp #45, part 15 should
take care of this.
o Interp #125. What is the meaning of BRE\{0,0\}? The
current wording leaves the behavior unspecified.
ACTION:
On XBD page 172, Section 9.3.6(4), add after end of
sentence on line 6112, "Zero occurrences of a BRE match
the empty string". On XBD page 175, Section 9.4.6(3),
add after end of sentence on line 6219, "Zero
occurrences of an ERE match the empty string".
Note
that this added sentence must also apply to parts (3),
(4) and (5).
o Doug #6. Does case folding apply to backreferences?
ACTION:
On XBD page 172, Section 9.3.6(3), add after line
6109, "When pattern matching is being performed without
regard to case, the backreference match will occur
without regard to case."
ACTION:
Also, on XBD page 170, Section 9.3.5(3), add after
end of sentence on line 6022, "Whenever pattern
matching is being performed without regard to case,
each character or collating element shall be deemed to
stand for itself and all its case counterparts." On
XBD page 168, Section 9.2, line 5954 change
"counterpart" to "counterparts".
It wasn't clear
whether there is such a thing as upper and lower
multi-character collating elements.
_____________________________________________________________________________
David Korn
research!dgk
yyy@xxxxxxxxxxxxxxxx
|