In message <037901c1e00b$846aac10$yyyyyyyy@xxxxxxxxxx>, "Nick Stoughton" writes
:
>This from Andrew Hume (yyyyyy@xxxxxxxxxx):
...
>the above analysis is simply wrong. a recursive analysis is
>/((week|wee)|(night|knights))(s*)/ matches weeknights.
>the left subexpression ((we ... ts)) can match weeknights, so it must
>(longest match),
> and thus (s*) is the empty string.
>recursing, the longest match for (week|wee) satisfying overall match is wee,
>which yields (night|knights) matching knights.
...
>(2) is wrong. it has to be the longest match; end of argument. what
>"left to right"
>means ha sto do with what REs actually mean, not how they are implemented.
>recall that a RE generates a (possibly infinite) set of strings. when we
>say that
>an RE "matches" a string, we mean that that string is a member of the
>set of strings
>generated by that RE. so how does
If this is the case, then the wording in the standard should be rewritten
and have a lot of (better) disambiguating examples. The current wording
and examples, with its references to subpatterns, implies locally maximal
matching as opposed to globally maximal matching. I shouldn't have jumped
into this thread midway because I hadn't appreciated that there had existed
an intent contradictory to the wording that as was so comprehensivley
explained by Paul Eggert's original post. But don't the minutes
David Korn just posted contradict Andrew Hume's statement of original
intent? Traditionally, Andrew is right about leftmost longest if
you're talking about parse tree and derivations (first and follow
sets, etc.) which will give you the leftmost of the longest matches
(i.e., a longest match period). So yes, the term leftmost longest is
misused by those with an NFA mindset, but the examples in the standard
and the minutes support that misuse, specifically by requiring the
longest of the leftmost matches for each subpattern (not the same as
the leftmost of the longest).
David Korn writes:
>For ((week|wee)(night|knights))(s*)
>
>The subexpressions are
>1. ((week|wee)(night|knights))
>2. (week|wee)
>3. (night|knights)
>4. (s*)
...
> \1 would be weekknights
> \4 empty
>Then, \2 and \3 would be computed by looking at all the ways
>that
> ((week|wee)(night|knights))
>could match weeknights and choosing the leftmost longest of (week|wee).
>However, the only match is
> wee knights
>so that \2 is wee and \3 is knights.
>
>I hope that the will clarify the intent at least.
OK, if that's the intent, it's consistent with what Andrew said and with
Paul's "aaaa" example. Quick, set it in stone before it's forgetten! I was
doing better before I read the standard and took it on faith from
somewhere long forgotten that POSIX regexps matched maximally. Then this
thread made me look at the standard and I got the impression "Wait a minute,
this is like Perl." Should've kept my mouth shut, but at least the
experience provides additional evidence that greater clarity is
needed in the standard.
daniel
|