eggert:
But I must start off thinking 'parse tree', since that's how
the POSIX standard defines ERE syntax.
I agree. If necessary, we can argue that at length. However, gsf is
only trying to show that the standard is ambiguous and _might_ lead to
right-associative semantics for concatenation.
Let me play devils advocate, for the moment, and argue his position in
the way that convinced me not to spend a lot of time arguing against
the possibility that the spec is ambiguous, if I can avoid it. (This
is my penance for orginally labeling his reading "absurd". :-)
`subpattern', in the matching rule, is ordinary english. It just
means any kind of constituent part of a larger pattern. So, what is a
"constituent part"? It might be, as lord and eggert maintain, a part
generated by any non-terminal in the grammar. _Or_, it might refer to
the construction of complex regexps given in the text.
In the text, EREs and BREs matching a single character are defined.
Then rules are given to build up more complex patterns from those.
For example, you can add `+' or `?' after an ERE matching a single
character or an ERE enclosed in parentheses. Another example: you can
(subject to the operator precedence rules) form a new ERE by
concatenating several other EREs.
So, in terms of the construction of the text, here is a definition for
`nth_subpattern':
function nth_subpattern (pattern, n)
{
if pattern is a parenthesized subexpression
return the pattern that is enclosed in parens
if the lowest precedence operator in pattern is *,+,{}, or ?
return the pattern with that operator removed
if the lowest precedence operator in pattern is |
treat | as a variable arity operator
return the nth branch
if the lowest precedence operator in pattern is concatenation
treat concatenation as a variable arity operator
return the nth concatenated element
}
Some examples:
nth_subpattern(/ab*c/, 0) => /a/
nth_subpattern(/ab*c/, 1) => /b*/
nth_subpattern(/ab*c/, 2) => /c/
nth_subpattern(/foo|foobar|baz/, 0) => /foo/
nth_subpattern(/foo|foobar|baz/, 1) => /foobar/
nth_subpattern(/foo|foobar|baz/, 2) => /baz/
nth_subpattern(/b*/, 0) => b
nth_subpattern(/[abc]/, 0) => <undefined>
Lord and eggert use an alternative definition for `nth_subpattern',
based on the grammar:
function nth_subpattern (pattern, n)
{
if pattern is a parenthesized subexpression
return the pattern that is enclosed in parens
if the lowest precedence operator in pattern is *,+,{}, or ?
return the pattern with that operator removed
if the lowest precedence operator in pattern is |
treat | as a left-associative binary operator
return the nth (0 or 1) branch
if the lowest precedence operator in pattern is concatenation
treat concatenation as a left-associative binary operator
return the nth (0 or 1) concatenated element
}
With examples:
nth_subpattern(/ab*c/, 0) => /ab*/
nth_subpattern(/ab*c/, 1) => /c/
nth_subpattern(/foo|foobar|baz/, 0) => /foo|foobar/
nth_subpattern(/foo|foobar|baz/, 1) => /baz/
nth_subpattern(/b*/, 0) => b
nth_subpattern(/[abc]/, 0) => <undefined>
Now, taking off my devil's advocate costume -- I think there are good
arguments that the grammar-based definition of `nth_subpattern' is
the only right reading of the spec. Unfortunately, the arguments are
based on fairly subtle wordplay.
So, what happens if we stipulate that the spec is ambiguous about
whether or not `|' and concatenation are binary operators or variable
arity? Then, as a consequence, we've stipulated that the spec is
ambiguous about whether or not concatenation and `|' have left or
right associative semantics.
When interpreting an existing standard, I think we have a set of
constraints to satisfy. These are:
================================
higest priority constraints
(but unordered among themselves)
A. What the standard says.
B. What historic implementations do,
perhaps giving some weight to the
conforming vendor implementations
and BSD.
C. What the standard authors clearly
meant.
D. What the historic implementors
clearly meant.
================================
low priority constraint
(use only to resolve ties)
E. What makes the most sense.
================================
Let's examine those constraints:
A. What the standard says.
We can argue forever about what the standard says, or
stipulate that it is ambigous.
For those of us who believe left-associativity is correct,
let's be generous and stipulate for the moment that the
spec is ambiguous.
B. What historic implementors do.
Libc `regexec' implementations are full of all kinds of bugs. They
aren't reliable guides at all.
Many open source programs, for example, avoid using `regexec'
at all -- presumably because implementations vary so wildly --
and instead include their own regexp implementations.
It's significant that the conformance test suites are very weak
regarding regexps.
In short, it doesn't look to me like there is much if any
application code in the world that depends on the associativity
question going one way or the other.
C. What the standard authors clearly meant.
The authors are silent on this issue. Their intent
does not clearly extend to this issue. It is quite
plausible that they did not realize associativity of
concatenation matters. In that case, they can not be
said to have had any intent on this issue at all.
It's significant that you can assume various definitions
of "subpattern" that _implicitly_ decide the associativity
of concatenation. Then you can give another definition
which is equivalent in every respect except that it
_explicitly_ decides associativity in the opposite way.
It's significant that _none_ of the example regexps in the
spec are sensative to the associativity issue.
So if you stated or assumed a definition of "subpattern",
but didn't _explicitly_ consider associativity, then
we can't look to your definition to answer any questions about
associativity: your intent might have decided associativity
simply by accident.
D. What the historic implementors clearly meant.
The implementors are largely silent on this issue.
(Am I the sole exception? I made Rx left associative a few years
ago as a deliberate choice, after studying the standard.)
When giving weight to what a historic implementation does
wrt. to associativity, the comments above for constraint
(C) apply here as well: any decision an implementor made
regarding associativity might very well have been an accidental
one.
E. What makes the most sense.
We can argue forever about constraints A..D, or we can pick the
answer that makes the most sense. Constraint E is there for
just such cases where A..D can't resolve the issue.
For the associativity question, I know of no answer that makes more
sense than to give regexps the useful algebraic properties that
follow from the left associativity of concatenation. More on
this in the next message.
|