Email List: Xaustin-group-lX
[All Lists]

regxp algebra (was Re: RE-ASSOC)

To: yyyyyy@xxxxxxxxxxx
Subject: regxp algebra (was Re: RE-ASSOC)
From: Tom Lord <yyyy@xxxxxxx>
Date: Tue, 30 Apr 2002 12:59:59 -0700 (PDT)
Cc: yyy@xxxxxxxxxxxxxxxx, yyyyyyyyyyyyyy@xxxxxxxxxxxxx, yyyyy@xxxxxxxxxxxxxx, yyyyy@xxxxxxxxxx
References: <200204091857.OAA35259@raptor.research.att.com> <200204092229.g39MTeX00802@shade.twinsun.com> <200204121008.DAA20963@morrowfield.home> <200204290716.DAA01769@raptor.research.att.com> <200204291846.g3TIkB109921@shade.twinsun.com> <200204300313.XAA78654@raptor.research.att.com> <200204300627.g3U6RV809501@sic.twinsun.com>
I have two messages that, taken together, make the case that left
associativity makes much more sense than right associativity.

One message is an abstract look at regexp algebra.  The other is
a concrete and specific example that can plausibly come up in
the real world.

Here is the abstract look:

-t



** Regexp Algebra

  Speaking of associativity, let's look at some regexp algebra.

  Suppose I have a pattern `<A>' and a pattern `<B>' and a string `S'.

  Suppose there is a prefix of `S', call it `Sa', that is the longest 
  prefix matching `<A>'.  The rest of the string, the suffix, we can 
  call `Sb' and suppose that it matches `<B>'.

  We can name the substring positions in these matches.  When `<A>'
  matches `Sa', we get substring positions `pmatch_A_Sa'.  When `<B>'
  matches `Sb', we get substring positions `pmatch_B_Sb'.

  Suppose that `<A>' has `Na' parenthesized subexpressions, and that `<B>'
  has `Nb' parenthesized subexpressions.

  Now think of the pattern `(<A>)(<B>)'.  In both left and right
  associative worlds, we know that that pattern matches `S'.  That will give
  us `pmatch_AB_S'.

  In fact, we know that:

<<<
        pmatch_AB_S[0].rm_so == 0
        pmatch_AB_S[0].rm_eo == length(S)

        for X in 1 .. Na+1

                pmatch_AB_S[X] == pmatch_A_Sa[X-1]

        for X in Na+2 .. Na+2+Nb

                if pmatch_B_Sb[X-(Na+2)] != -1 

                  pmatch_AB_S[X] == pmatch_B_Sb[X-(Na+2)] - length(Sa)

                otherwise

                  pmatch_AB_S[X] == {-1, -1}
>>>

  That's very handy and is equally true in left and right associative
  worlds. 

  However, parentheses are an expensive operator and they interfere with
  the usefulness of backreferences.  Optimizations that can eliminate
  parentheses are worth knowing about.  Some special cases can be used
  to eliminate parentheses!

*** Special Case #1 in the Left Associative World

  Suppose that the lowest precendence operator in `<A>' is _not_ `|'.

  Then an invariant like the one above (but with slightly different
  pmatch index offsets) applies to the pattern `<A>(<B>)'.  One
  less set of parentheses!

  Note that the pattern produced by this special case, `<A>(<B>)' is
  itself a pattern in which the lowest precendence operator is not
  `|',  so we can apply this special case rule to that pattern again,
  giving us `<A>(<B>)(<C>)', then `<A>(<B>)(<C>)(<D>)', etc.  Very nice.


*** Special Case #1 in the Right Associative World

  But, aha!  That's nothing special.  Suppose that this time the 
  lowest precendence operator in `<B>' is not `|'.  Then in the
  right associative world, the similar invariant applies to
  `(<A>)<B>'.  Again, one less set of parentheses than in the general
  rule!

  Note that in _this_ special case, we didn't require anything special
  of pattern `<A>'.  If we have `<C>', `<D>', etc. in which the
  lowest precendence operator is not `|', then we can again use this
  rule repeatedly to build up `((<A>)<B>)<C>', then
  `(((<A>)<B>)<C>)<D>', etc.

  Isn't that just as nice?  It's a little odd that the reason we can 
  repeat this rule came out differently -- but let's gloss over that
  for the moment.


*** So aren't Left and Right Associativity Simply Symmetric?

  So far, it looks like symmetry will kill off any advantage of
  left-associativity.  But wait -- I'll soon fix that.  Let's look at
  some special cases of the special cases.


*** Special case #2 in the Left Associative World

  As before, the lowest precendence operator in `<A>' is _not_ `|'.

  In addition, this time, assume that `<B>' is either a parenthesized
  expression, an expression matching a single character, or a 
  pattern followed by a duplication or option operator (`?, {n,m}, *, +').

  Now this time, invariants like the ones given above (again, changing
  the `pmatch' offsets slightly) hold for the pattern: `<A><B>'.

  Huzzah!  No new parens at all.  The resulting pattern is ready to be 
  fed back into this special-special case as the new `<A>', and we
  can keep adding simple expressions to build up `<A><B><C>',
  `<A><B><C><D>' etc.  [I've given a concrete example of how this
  is useful in the next message.]


*** Special case #2 in the Right Associative World.

  As before, the lowest precendence operator in `<B>' is _not_ `|'.

  In addition, this time, assume that `<A>' is either a parenthesized
  expression, an expression matching a single character, or a 
  pattern followed by a duplication or option operator (`?, {n,m}, *, +').

  Now this time, invariants like the ones given above (again, changing
  the `pmatch' offsets slightly) hold for the pattern: `<A><B>'.

  Hrumph!  No new parens at all.  The resulting pattern is a new
  pattern in which the lowest precendence operator is _not_ `|', so
  special case #2 applies to the new pattern in the role of `<B>': We
  can feed the new pattern back into this special-special case 
  and keep adding simple expressions to build up `<C><A><B>',
  `<D><C><A><B>' etc.


*** Not Quite Symmetric

  Opps!  The symmetry has clearly broken.  Special case #2 gives the
  left associative world a simple way to build up parentheses-less
  patterns from left to right.  In the right associative world, special
  case #2 gives us a simple way to build up parentheses-less patterns
  from right to left.  I think it's obvious that building up regexps
  from left to right is the activity we want to support [and the next
  message will make this clearer through a specific example].

  Really, this same bug was already present in the special case #1 for
  the right associative world -- that's what we "glossed over" -- but
  special case #2 is what makes the consequences of the bug starkly
  apparent.

  When we build up complex patterns from simple ones, we are more
  likely to be trying to extend a match to the right than to start it
  earlier in the string.  The left-or-right associativity question
  isn't an unimportant question at all: it relates in a signifcant
  way to the fact that the left part of a pattern matches the left
  part of a string, the right part of a pattern the right part of a
  string.  While the pattern algebras may be symmetric, the two ends
  of the string are "beginning" and "end" which are opposite things,
  not equivalent things.


*** Your Left-Leaning Friends are Right

  So, left associativity makes it easier for us to build up efficient
  regexps by extending existing regexps to the right.  In order to
  provide a more useful regexp algebra, concatenation is left
  associative.  If we compare `(wee|week)(night|knights)(s+)' to
  `weeknightssss', `\1' is `wee' and `\3' is `sss'.


<Prev in Thread] Current Thread [Next in Thread>