Email List: Xaustin-group-lX
[All Lists]

regxp algebra example (was Re: RE-ASSOC)

To: yyyyyy@xxxxxxxxxxx
Subject: regxp algebra example (was Re: RE-ASSOC)
From: Tom Lord <yyyy@xxxxxxx>
Date: Tue, 30 Apr 2002 13:01:48 -0700 (PDT)
Cc: yyy@xxxxxxxxxxxxxxxx, yyyyyyyyyyyyyy@xxxxxxxxxxxxx, yyyyy@xxxxxxxxxxxxxx, yyyyy@xxxxxxxxxx
References: <200204091857.OAA35259@raptor.research.att.com> <200204092229.g39MTeX00802@shade.twinsun.com> <200204121008.DAA20963@morrowfield.home> <200204290716.DAA01769@raptor.research.att.com> <200204291846.g3TIkB109921@shade.twinsun.com> <200204300313.XAA78654@raptor.research.att.com> <200204300627.g3U6RV809501@sic.twinsun.com>

Let's consider the problem of block comments in the mythical
programming language "xyzzy".  The "xyzzy" comment syntax is similar
to that of shells, or lisp dialects, or the "//" syntax of C++.

A "block comment" in "xyzzy" consists of one or more lines, each
beginning with optional whitespace, followed by ":", followed by
arbitrary characters.


   : This is a 
   :       three line
   :               block comment


People sometimes use punctuation charcters to decorate their block
comments.

   : **********************************
   : **    critical section          **
   : **                              **
   : ** Interrupts must be masked.   **
   : **********************************


For a particular programming project, the developer's decide
on the convention of using "tagged block comments".  A tagged
blocked comment is a block comment in which the first line
that does not contain only punctuation starts with optional
punctuation, followed by a "comment keyword", followed by either a
newline or mandatory comment decoration.

A "comment keyword" is any of a list of strings, defined in some file
somewhere, with the only restrictions being that strings may not
contain newlines, and must contain at least one character which is
neither a space character or punctuation.

So if the comment keywords are:

   critical
   fix-this
   invariant

then one example block comment is:

   : invariant - `homedir' is either nil or a string

another is:

   : ********************************
   : *          invariant           *
   : *                              *
   : * `homedir' is either nil or   *
   : * a string.                    *
   : ********************************


Given a string which contains a complete source file, the challenge is
to write a regexp that finds all block comments, and that can be used
to detect which block comments are tagged comments, and which are not.
(Perhaps my program is supposed to generate documentation for the
source code based on the comments.)  We can assume for simplicity that
the last line of the string definately ends with a newline.

Such a pattern can be built up in small pieces.

   <comment-start> := "^[[:blank:]]*:"

That regexp finds the first line of all block comments.

   <comment-decoration> := "[[:punct:][:blank:]]*"

That regexp matches a decoration part of a comment line.

Let's assume we have a regexp that can match the comment tags:

   <comment-tag>   :=  ...an exact match for any comment tag strings ...

A block comment can start with 0 or more lines that are purely
decorative.  Each of these lines matches:

   <decoration-line> := <comment-start><comment-decoration>*\n

So:

   <comment-prefix-1> := (<decoration-line>)*

If the comment continues, the next line can begin with some 
decoration (I've split the regexp over multiple lines to make it 
easier to read):

   <comment-prefix-2> := (<decoration-line>)*
                         <comment-start>
                         <comment-decoration>*

After that decoration, there is an optional comment tag string:

   <comment-prefix-3> := (<decoration-line>)*
                         <comment-start>
                         <comment-decoration>*
                         (<comment-tag>$|<comment-tag><comment-decoration>+)?

That's handy.  That regexp will match the first part of all blocked
comments.  If the parenthesized "comment-tag" part matches anything at
all, then the blocked comment is a tagged comment.  The user could
usefully test this part of the regexp out in isolation, even.

The rest of the first non-decoration line is arbitrary:

   <comment-prefix-3> := (<decoration-line>)*
                         <comment-start>
                         <comment-decoration>*
                         (<comment-tag>$|<comment-tag><comment-decoration>+)?
                         .*

and is followed by a newline:

   <comment-prefix-4> := (<decoration-line>)*
                         <comment-start>
                         <comment-decoration>*
                         (<comment-tag>$|<comment-tag><comment-decoration>+)?
                         .*
                         \n

And, the rest of the block comment is any subsequent lines that are 
block comment lines:

   <block-comments>   := (<decoration-line>)*
                         <comment-start>
                         <comment-decoration>*
                         (<comment-tag>$|<comment-tag><comment-decoration>+)?
                         .*
                         \n
                         (<comment-start>.*\n)*

So, did we build a regexp that reliably finds all comments, and
reliably detects tagged comments?  In a left associative world, I'm
confident that we did, just because of the way we built the regexp.

At each step, I had one regexp (call it /<A>/) in which the lowest
precedence operator was not '|'.  That regexp matched a prefix of the
total string I wanted to eventually match, but nothig beyond that
prefix.

At each step I also had a second regexp (call it /<B>/) that was
either a single character pattern, a parenthesized pattern, or a
pattern whose lowest precedence operator was one of ?, +, *, or {}.
This second regexp matches part of the string beyond the prefix that
matches /<A>/.

In the left associative world, I know that in general I can write the
pattern /<A><B>/, that /<A>/ will continue to match the prefix in
exactly the same way, and that /<B>/ will extend the match.  I don't
even care about the details of the /<comment-tag>/ regexp, beyond the
restrictions stated in the problem statement (i.e. it can not contain
a newline or consist of nothing but comment decoration).

So in a left associative world, I can be confident that my regexp works.

In a right-associative world, my regexp is broken.  Suppose, for
example, that the tag string list is:

   critical
   invariant
   *FIX*

and I have the comment:


   : ********************************
   : *          *FIX*               *
   : *                              *
   : * this doesn't handle the case *
   : * that `homedir' is nil        *
   : ********************************


The first line of the block-comment pattern:

   <block-comments>   := (<decoration-line>)*

will match the first line of the comment:

   : ********************************

The next two lines of the pattern:

                         <comment-start>
                         <comment-decoration>*

will match part of the second line:

   : *          *

The fourth line of the pattern:

                         (<comment-tag>$|<comment-tag><comment-decoration>+)?

will match the empty string.  The parenthesized part won't match
anything at all, so I won't detect that this is a tagged comment.

The rest of the pattern will match the rest of the block comment.

In a right associative world I could ``fix'' the pattern by adding 
extra parentheses, though I'd have to curse under my breath the 
specification that made regexp concatenation right associative.

-t

<Prev in Thread] Current Thread [Next in Thread>