| To: | Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx> |
|---|---|
| Subject: | RE-ASSOC: a question about the associativity of RE concatenation |
| From: | Mark Brown <yyyyy@xxxxxxxxxx> |
| Date: | Fri, 5 Apr 2002 15:37:31 -0600 |
| Cc: | yyyyy@xxxxxxxxxxxxxx, "Shoji Sugiyama" <yyyyy@xxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx> |
|
RE-ASSOC: a question about the associativity of RE concatenation [Some of the people in the GNU projects (Paul Eggert, Tom Lord, Isamu Hasegawa, myself) were considering consolidating the various regex implementations in GNU, and came up with a series of questions concerning the Specification. This is one of those questions.] Must pattern matching strictly follow the left-associative POSIX grammar for regular expressions, or is pattern matching allowed to behave as if the grammar had been right-associative? This question follows up the earlier question RE-CONCAT; please see the discussion of the earlier question for much of the background behind this question. The POSIX grammar for EREs contains this production: ERE_branch : ERE_expression | ERE_branch ERE_expression <http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05_03> This grammar is left-associative, which means that the ERE /ABC/ is parsed as if it were /(AB)C/, aside from parenthesis numbering. In the REs of classical formal language theory developed by Stephen Kleene and others, the concatenation of REs is associative, in the sense that the REs /(AB)C/ and /A(BC)/ describe the same regular sets. But POSIX REs might have different possible interpretations depending on how one associates subpatterns to matched strings. For example, suppose we match the ERE /(week|wee)(night|knights)(s*)/ to the string "weeknights", and want to know which subpatterns match which substrings. Here are some possible interpretations: (1a) The "subpatterns" are the two immediate subexpressions, interpreted according to the grammar, namely /(week|wee)(night|knights)/ and /(s*). The longest consistent match for /(week|wee)(night|knights)/ is "weeknights". Therefore, /(s*)/ matches the empty string. (1b) Even though the grammar is left-associative, the matching semantics are right-associative. The "subpatterns" are the two subexpressions /(week|wee)/ and /(night|knights)(s*)/. The longest consistent match for /(week|wee)/ is "week". We then apply the rule recursively, and find that the longest consistent match for /(night|knights)/ against "nights" is "night". Therefore, /(s*)/ match "s". (2) The "subpatterns" are all the subexpressions, not just the concatenated subexpressions (see interpretation (2) of question RE-CONCAT), so the matching semantics are inherently right-associative. This returns the same match as (1b), but for the slightly different ERE /(wee|week)(night|knights)(s*)/, (1b) still says /(s*)/ matches "s" but (2) says /(s*)/ matches the empty string. (3) The standard is ambiguous, and some or all the above interpretations are valid. Perhaps other interpretations are valid as well. A conforming implementation might even use different interpretations at different times. (1a) and (1b) are both consistent with interpretation (1) of question RE-CONCAT. Note that a user can override either (1a) or (1b) by using explicit parenthesization. (2) is consistent with RE-CONCAT interpretations (2); under (2), explicit parenthesization is irrelevant. Here are some arguments for and against these interpretations. These lengthy arguments are briefly summarized at the end of this section. pro (1a): The most natural interpretation of the standard and rationale is that the implementation must recurse through the regular expression's parse tree, as indicated by the grammar. The ERE must be parsed strictly according to the grammar, because the standard says that the ERE "grammar takes precedence over the text". <http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_05> con (1a): (1a) is counterintuitive, because it says that when matching /XYZ/, it is more important to maximize the length of the match for /XY/ than it is to maximize the length of the match for /X/. This contradicts the left-to-right nature of the subpattern rule. Had the POSIX authors intended (1a), they would have written the grammar to be right-associative, to avoid this counterintuitive behavior of (1a). pro (1a): Had the POSIX authors intended that it be more important to maximize the length of /X/ than to maximize the length of /XY/ when matching /XYZ/, they could easily done so by using a right-associative grammar for ERE_branch. However, they used a left-associative grammar, indicating that (1a) was intended over (1b). This is reminiscent of the situation in the C programming language. Classical addition is associative, but Standard C requires that X+Y+Z must be evaluated according to C's left-associative grammar, and must therefore be evaluated as (X+Y)+Z and not as X+(Y+Z). The distinction matters, for example, if floating-point arithmetic is used, due to rounding errors. Similarly, classical RE concatenation is associative, but POSIX RE concatenation is not, due to the side effects of subpattern matching, so it is important to follow the associativity rules specified by the grammar. con (1a): The C Standard explicitly says that X+Y+Z must be evaluated as (X+Y)+Z, whereas the POSIX Standard does not explicitly say how to follow the RE grammar when determining matched subpatterns. Admittedly the X+Y+Z statement in the C Standard is a non-normative footnote, but the POSIX authors could also have added a footnote if they had wanted to make it clear that (1a) was intended. pro (1b) and (2): The semantics should not be contorted by the particular method used to define the syntax. The standard says: Portions of this volume of IEEE Std 1003.1-2001 are expressed in terms of a special grammar notation. It is used to portray the complex syntax of certain program input. The grammar is based on the syntax used by the yacc utility. However, it does not represent fully functional yacc input, suitable for program use; the lexical processing and all semantic requirements are described only in textual form. The grammar is not based on source used in any traditional implementation and has not been tested with the semantic code that would normally be required to accompany it. <http://www.opengroup.org/onlinepubs/007904975/utilities/xcu_chap01.html#tag_01_10> and this indicates that the grammar should not be taken overly seriously when interpreting the semantics. pro (1a): Some POSIX matchers use (1a). These include the Rx matcher and the Spencer-derived matchers used by Tcl and others. See also the arguments pro and con RE-CONCAT's interpretations (1), (2), and (3). The bottom line of this discussion: This question is probably moot unless RE-CONCAT interpretation (1) is chosen, in which case the realistic choices are (1a) or (1b). (1a) follows the POSIX standard more strictly, but (1b) typically matches user intent better. The differences between (1a) and (1b) rarely arise in practice, so it wouldn't affect applications much if either interpretation were chosen. A user can override either (1a) or (1b) by using explicit parenthesization. Suggestion: Modify the POSIX grammar so that RE concatenation is right-associative. This would cause (1a) to have the semantics of (1b), thus unifying the two interpretations and removing the confusion. --- Mark S. Brown Senior Technical Staff Member, IBM Server Group 512.838.3926 fax 512.838.3882 yyyyy@xxxxxxxxxx |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | RE_CONCAT: question about RE concatenation and subpattern matching, Mark Brown |
|---|---|
| Next by Date: | Re: RE-ASSOC: a question about the associativity of RE concatenation, Keld Jørn Simonsen |
| Previous by Thread: | RE_CONCAT: question about RE concatenation and subpattern matching, Mark Brown |
| Next by Thread: | Re: RE-ASSOC: a question about the associativity of RE concatenation, Keld Jørn Simonsen |
| Indexes: | [Date] [Thread] [All Lists] |