Email List: Xaustin-group-lX
[All Lists]

Re: XBD ERN 7 -- regcomp and sed

To: yyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Re: XBD ERN 7 -- regcomp and sed
From: Glenn Fowler <yyy@xxxxxxxxxxxxxxxx>
Date: Tue, 25 May 2004 12:53:40 -0400 (EDT)
Organization: AT&T Labs Research
References: <1085077669.5259.440.camel@collie>
> Geoff Clare wrote:
> I spotted a couple of minor things in the new text:

> 1. The phrase "back-reference expressions to the contained subexpression"
> seems a bit odd.  I'm guessing it was first drafted as "back-references
> to ..." and then "back-references" got changed to "back-reference
> expressions".  Maybe "back-reference expressions corresponding to the
> contained subexpression" would work.

Thanks. This was the toughest part to translate (from the regexec()
subexpression[i]/subexpression[j] text.)

> 2. The last example is wrong.  The expression "\(ab*\)*\1" does
> match 'ababbab' - it matches the first four characters (i.e. the
> subexpression matches 'ab' and then \1 matches 'ab').  The example
> would work okay with an anchored RE.

I took a short cut (i.e., didn't try it) on the last one and paid for it. 

Here is the revised text:

  The string matched by a contained subexpression shall be within the
  string matched by the containing subexpression.  If the containing
  subexpression does not match, or if there is no match for the contained
  subexpression within the string matched by the containing subexpression
  then back-reference expressions corresponding to the contained
  subexpression shall not match.  When a subexpression matches more than
  one string, a back-reference expression corresponding to the
  subexpression shall refer to the last matched string.  For example, the
  expression "^\(.*\)\1$" matches lines consisting of two adjacent
  appearances of the same string, the expression "\(a\)*\1" fails to
  match 'a', the expression "\(a\(b\)*\)*\2" fails to match 'abab', and
  the expression "^\(ab*\)*\1$" matches 'ababbabb' but fails to match
  'ababbab'.

Also, I ran ERN-7 by Doug McIlroy and he noted that the sed substitute
command description only specifies what is substituted when a backreference
expression refers to a subexpression that matches:

  The characters "\n", where n is a digit, shall be replaced by the
  text matched by the corresponding backreference expression.

This should probably be revised to handle cases where the backreference
expression does not match:

  The characters "\n", where n is a digit, shall be replaced by the
  text matched by the corresponding backreference expression, or by
  the empty string if the the corresponding backreference expression
  does not match.

This seems to align with sed behavior.

-- Glenn Fowler -- AT&T Labs Research, Florham Park NJ --

<Prev in Thread] Current Thread [Next in Thread>