| To: | Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx> |
|---|---|
| Subject: | RE_CONCAT: question about RE concatenation and subpattern matching |
| From: | Mark Brown <yyyyy@xxxxxxxxxx> |
| Date: | Fri, 5 Apr 2002 13:23:51 -0600 |
| Cc: | Isamu Hasegawa <yyyyy@xxxxxxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx> |
|
RE-CONCAT: a question about RE concatenation and subpattern matching [Some of the people in the GNU projects (Paul Eggert, Tom Lord, Isamu Hasegawa, myself) were considering consolidating the various regex implementations in GNU, and came up with a series of questions concerning the Specification. This is one of those questions.] When the regular expression /XY/ matches an entire string S, must the subpattern /X/ match the longest prefix of S that is consistent with the overall match? This is a question of interpretation about the following sentence in the POSIX standard: Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. <http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02> For reference, the POSIX rationale goes on to say: It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each subexpression, but only with the proviso that the overall match is leftmost longest. <http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap09.html#tag_01_09_01> Here is an example to illustrate the question. Suppose we match the ERE /((week|wee)(night|knights))(s*)/ to the entire string "weeknights", and want to know which subpatterns match which substrings. Here are some possible interpretations: (1) The "subpatterns" are the two immediate subexpressions. The longest consistent match for /((week|wee)(night|knights))/ is "weeknights". Therefore, /(s*)/ matches the empty string. (2) The "subpatterns" are the (not necessarily immediate) subexpressions, regardless of whether they contribute directly to the concatenation. One subexpression is considered to be left of another subexpression if it occurs before the other in the postorder traversal of the ERE's parse tree, when the ERE is parsed according to the POSIX grammar for EREs. In this left-to-right order, the subexpressions are: /week/ /wee/ /week|wee/ /(week|wee)/ /night/ /knights/ /night|knights/ /(night|knights)/ /(week|wee)(night|knights)/ /((week|wee)(night|knights))/ /s/ /s*/ /(s*)/ /((week|wee)(night|knights))(s*)/ (There are actually more subexpressions -- for example, /night/ has eight subexpressions /n/, /i/, /ni/, /g/, /nig/, /h/, /nigh/, /t/ -- -- but these extra subexpressions do not contribute to the analysis and are omitted for brevity.) We apply the longest-match rule to each of these subexpressions in order. The longest consistent match for /week/ is "week", so /week|wee/ and /(week|wee)/ match "week". The longest consistent match for /night/ against the trailing suffix "nights" is "night", so /night|knights/ and /(night|knights)/ both match "night". Hence /(week|wee)(night|knights)/ and /((week|wee)(night|knights))/ both match "weeknight". Finally, /s/, /s*/, and /(s*)/ match "s". (3) The standard is ambiguous, and both the above interpretations are valid. Perhaps other interpretations are valid as well. A conforming implementation might even use different interpretations at different times. Here are some arguments for and against these interpretations. These lengthy arguments are briefly summarized at the end of this section. pro (1): (1) is simple and straightforward. con (2): (2) is counterintuitive, because it causes subexpressions to not find a longest consistent match, despite the longest-match rule. For example, under (2), when matching /(.|..).*/ against "ab", most people would expect /(.|..)/ to match "ab" because it is a longer match, but (2) requires it to match the shorter substring "a". Conversely, most people would expect /(.|..)/ and /(.{1,2})/ to be equivalent, but (2) says that when matching /(.{1,2}).*/ against "ab", /(.{1,2})/ matches "ab" instead of the "a" that /(.|..)/ would match in its place. Here is another example. (2) causes the interpretation of a regular expression X to be changed merely because X is followed by /.*/. This contradicts the left-to-right nature of the subpattern rule. For example, under (2), when matching /(.|..)/ against "ab", /(.|..)/ matches the whole string; but when matching /(.|..).*/ against "ab", /(.|..)/ matches only "a" even though it would be consistent for it to match the same "ab" as before. In general, submatching must be context-dependent, but it is counterintuitive for /.*/ to provide context from the right in a matcher that is supposed to be left-to-right. This counterintuitive behavior does not occur with (1), nor does it occur with a purely greedy NFA-style matcher that does not conform to POSIX because it does not always find the longest match; but it does occur with (2). A more general statement of this point is that '|' is not commutative. For example, /(week|wee)(night|knights)(s*)/ differs from /(wee|week)(night|knights)(s*)/ when matching against "weeknights". The former pattern matches /(s*)/ to "s", whereas the latter one matches /(s*)/ to the empty string. This surprises some users and is an annoyance for programs that construct regular expressions. pro (2): Many people want the left subexpression of '|' to have priority. con (2): POSIX explicitly rejected the expectation that the left subexpression of '|' should have priority, by ruling that it's more important to have the longest match than to have the left subexpression match. The left subexpression does not have priority in /(.|..)/, for example. pro (2): If you want the longest match, then in almost all cases you can reorder the alternatives of '|' so that the leftmost subexpression matches the longest substring. With (2) you have the option of preferring a shorter match for a '|' subexpression, if that is what you want. (1) does not give you that option. con (2): But under (2) you can't give an alternative priority at the top level, unless the alternatives happen to be the same length. It is odd for (2) to have one rule at the top level, but a different rule for subexpressions. It would be more consistent if the same rule applied at all levels. It is particularly odd that the standard doesn't mention this discrepancy; it indicates that the discrepancy was not intended. con (2): It's implausible that (2) was intended, given that the standard does not mention issues like postorder traversal which are inherent to (2). Had the POSIX authors intended the counterintuitive behaviors inherent to (2), they would have mentioned them. pro (2): In effect, (1) specifies a preorder traversal, and (2) specifies a postorder traversal. The standard does not specify either preorder or postorder, so neither is more plausible than the other and postorder should be allowed. con (2): The POSIX authors were clearly thinking about DFA-based implementations. This is why <http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02> gives the example where /(wee|week)(knights|night)/ matches all of "weeknights". Here, the authors are making the point that a DFA-based matcher must be run while a further match is possible, and that one cannot stop the DFA on the first match. Had the POSIX authors wanted to allow (2), they would have been inclined to use an example that distinguishes between a greedy NFA-style matcher and (2), because a greedy NFA-style matcher is incorrect but is superficially similar to (2). For example, they could have used the example of matching /(week|wee)(night|knights)/ against "weeknights", where a greedy NFA-style matcher will match only "weeknight" instead of the "weeknights" required by (2). The fact that the POSIX authors did not give such an example indicates that they were not considering (2) as a possible interpretation. pro (1): (1) takes less space to implement than (2) does. If P is the pattern length and S is the string length, (1) requires only O(P) space, but (2) requires O(S) space and the constant multiple of S is relatively large in practice. (2) needs this space to record the states of the DFA match, so that it can then run an NFA forward across the same string and use the results of the DFA match as an oracle at each nondeterministic branch. (2) can be done in less than O(S) space, but only with an exponential time overhead that would be unacceptable in practice. pro (2): There is little difference in performance in practice if you optimize (2) carefully enough, because the worst cases occur only in pathological examples. con (2): Optimizing (2) is tricky, and this trickiness is a maintenance burden that it would be better to avoid. pro (2): Many existing POSIX matchers seem to use (2), including Solaris 8 and the GNU C library 2.2. con (2): However, these implementations predate POSIX.2-1992, and though they were modified to conform to POSIX, quite possibly this particular issue hasn't been tested in the available test suites, or perhaps the test suites themselves are not correct. Since POSIX.2-1992 explicitly rejected the common existing practice of greedy NFA-style matching, one shouldn't rely too heavily on the perhaps-unmodified continuation of that existing practice. con (2): The BSD matcher uses (2), but it is not a POSIX matcher because it does not consistently find the longest overall match. Under BSD, matching the BRE /\(\(abc\)\{0,2\}\(abc\)\{0,2\}\)\3/ against the string "abcabc" incorrectly yields only "abc" in both FreeBSD 3.0 and OpenBSD 3.0. This suggests that (2) is correlated with the pre-POSIX practice that POSIX.2-1992 rejected. pro (1): Some POSIX matchers use (1). These include the Rx matcher and the Spencer-derived matchers used by Tcl and others. pro (3): Insisting on either (1) or (2) would invalidate some existing implementations. con (3): Insisting on either (1) or (2) would invalidate few if any existing applications, and applications are more important than implementations. The POSIX authors explicitly chose DFA-style semantics for REs because they are "easier to define and describe" than other semantics. They also wrote: It is thought that dependencies on the choice of rule are rare; carefully contrived examples are needed to demonstrate the difference. (Both these quotes are taken from <http://www.opengroup.org/onlinepubs/007904975/xrat/xbd_chap09.html#tag_01_09_02> .) This suggests that the POSIX authors desired a simple, deterministic semantics for regular expressions, and thought that most users wouldn't care so much which semantics were chosen, so long as the behavior was standardized. It is therefore unlikely that the POSIX authors intended the complicated, ambiguous semantics that would result if multiple interpretations were valid. The bottom line of this discussion: (1) is simpler and more intuitive, and it better fits the intent of POSIX's overall longest-match rule. (2) is more popular in practice. The differences between (1) and (2) rarely arise in practice, so it wouldn't affect applications much if either interpretation were chosen. --- Mark S. Brown Senior Technical Staff Member, IBM Server Group 512.838.3926 fax 512.838.3882 yyyyy@xxxxxxxxxx |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | Re: LC_MONETARY and int_currency_symbol, Andrew Josey |
|---|---|
| Next by Date: | RE-ASSOC: a question about the associativity of RE concatenation, Mark Brown |
| Previous by Thread: | Agenda for the May meeting, Andrew Josey |
| Next by Thread: | Re: RE_CONCAT: question about RE concatenation and subpattern matching, Glenn Fowler |
| Indexes: | [Date] [Thread] [All Lists] |