| To: | Austin Group <yyyyyyyyyyyyyy@xxxxxxxxxxxxx> |
|---|---|
| Subject: | RE-ITERATE: a question about RE iteration and subpattern matching |
| From: | Mark Brown <yyyyy@xxxxxxxxxx> |
| Date: | Fri, 5 Apr 2002 16:24:03 -0600 |
| Cc: | yyyyy@xxxxxxxxxxxxxx, "Shoji Sugiyama" <yyyyy@xxxxxxxxxx>, yyyy@xxxxxxx, Paul Eggert <yyyyyy@xxxxxxxxxxx> |
|
[Some of the people in the GNU projects (Paul Eggert, Tom Lord, Isamu Hasegawa, myself) were considering consolidating the various regex implementations in GNU, and came up with a series of questions concerning the Specification. This is one of those questions.] RE-ITERATE: a question about RE iteration and subpattern matching When the ERE /(R)*/ matches an entire string S, is the subpattern /(R)/ required to match the longest possible suffix of S that is consistent with an overall match? This question follows up the earlier questions RE-CONCAT and RE-ASSOC; please see the discussion of the earlier question for much of the background behind this question. This question is about the requirements given by the following quotes from the regcomp description: (Subexpression i begins at the ith matched open parenthesis, counting from 1.)... 1. If subexpression i in a regular expression is not contained within another subexpression, and it participated in the match several times, then the byte offsets in pmatch[i] shall delimit the last such match. <http://www.opengroup.org/onlinepubs/007904975/functions/regcomp.html>: Suppose, for example, we match the pattern /(a.*z|b.*y)*.*/ to the string "azbazbyc", and want to know which substring was matched by the subpattern /(a.*z|b.*y)/. Here are some possible interpretations: (1) /(a.*z|b.*y)*/ matches the longest possible prefix "azbazby". The last match for /(a.*z|b.*y)/ is therefore the longest suffix of this prefix that is consistent with an overall match, namely "bazby". (2) First the subpattern /(a.*z|b.*y)/ is applied left to right across the string, looking for matches that are consistent with an overall match. This procedure matches "azbaz" and "by" in turn. The last match for /(a.*z|b.*y)*/ is therefore "by". (3) The standard is ambiguous, and both the above interpretations are valid. Perhaps other interpretations are valid as well. A conforming implementation might even use different interpretations at different times. Interpretation (2) needs some further explanation, since it relies on a depth-first search procedure that may not be immediately obvious. According to this procedure, to match the ERE /(A)*/ to the prefix of a string S under the constraint of an overall match to S, apply the following steps: (i) Apply A to the prefix of S once, subject to the constraint of overall match. (ii) Suppose that A's match consumes the initial substring S1, which must be the quasi-longest such substring. (iii) If S1 is empty, succeed and report that A matched the empty string. (iv) Otherwise, repeatedly apply A to the rest of S, each time finding the quasi-longest initial prefix matching A under the constraint of overall match. Repeat this process until the next submatch would either fail or would be empty. (v) Report the last match, which in this case must be nonempty. The term "quasi-longest" is used above because (2)'s matching procedure is greedy and in some cases doesn't find the longest match for a subexpression. Instead, it finds the longest match that a greedy matcher can find subject to the constraint of overall match. Interpretations (1)-(3) are related to interpretations (1)-(3) of question RE-CONCAT, and share many of their pros and cons. Here are some more specific arguments for and against these interpretations: pro (1): The standard explicitly says that the subpattern must match the "longest possible string" <http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html#tag_09_01_02> . "bazby" is the longest possible string here, so it must be the match. pro (2): The phrase "left to right" implies that the matcher must apply the subpattern several times, left to right, greedily but consistently with the overall match being the longest. con (2): Even more so than with RE-CONCAT, interpretation (2) is quite complicated. It is unlikely that the POSIX authors intended (2), since (1) is a much more straightforward interpretation. --- Mark S. Brown Senior Technical Staff Member, IBM Server Group 512.838.3926 fax 512.838.3882 yyyyy@xxxxxxxxxx |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | Re: RE-ASSOC: a question about the associativity of RE concatenation, Keld Jørn Simonsen |
|---|---|
| Next by Date: | Re: RE-ASSOC: a question about the associativity of RE concatenation, Tom Lord |
| Previous by Thread: | RE-ASSOC: a question about the associativity of RE concatenation, Mark Brown |
| Next by Thread: | Re: Defect in XSH unlink, Andrew Josey |
| Indexes: | [Date] [Thread] [All Lists] |