Email List: Xaustin-review-lX
[All Lists]

Defect in XCU awk

To: yyyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Defect in XCU awk
From: yyyyyy@xxxxxxxxxxx
Date: Tue, 11 May 2004 21:13:19 +0100 (BST)
        Defect report from : Paul Eggert , UCLA

(Please direct followup comments direct to yyyyyyyyyyyyyy@xxxxxxxxxxxxx)

@ page 177 line 6074 section awk objection {20040511a}

Problem:

Edition of Specification (Year): 2004

Defect code :  1. Error

The C99 standard introduced the notion of an explicit string
representation for infinities and NaNs, and since the POSIX "awk"
specification refers to C99, POSIX "awk" is required to support them.
However, the POSIX specification was not updated with this C99 change
in mind, as as a result POSIX "awk" is required to support infinities
and NaNs in some contexts but not others.  This should be fixed,
either by requiring support for them everywhere, or disallowing it
everywhere.

Here's the problem.  XCU page 177 lines 6074-6079 says that a string
value is considered a numeric string only if:

   after all the following conversions have been applied, the
   resulting string would lexically be recognized as a NUMBER token as
   described by the lexical conventions in Grammar:

     * All leading and trailing <blank>s are discarded.

     * If the first non- <blank> is '+' or '-' , it is discarded.

     * Changing each occurrence of the decimal point character from
       the current locale to a period.

and the rationale (page 181 lines 7121-7124) says:

   The intent has been to specify historical practice in almost all
   cases. The one exception is that, in historical implementations,
   variables and constants maintain both string and numeric values
   after their original value is converted by any use.

As I understand it, historical practice was to invoke strtod on the
string, and to check that the string was entirely parsed by strtod
except possibly for some trailing blanks.  This matches the
specification above.  However, now that C99 has modified strtod's
behavior with respect to infinities and NaNs, there is a discrepancy
between this historical practice and what is now specified.

For example, in the POSIX locale the following shell command:

   awk -v n=-INF -v p=+INF 'BEGIN {print (n < p)}' </dev/null

must print 0 according to the specification since "+" precedes "-" in
the POSIX collating sequence.  This is true for many awk
implementations (e.g., Solaris 9 /usr/xpg4/bin/awk), but with some
implementations (e.g., GNU Awk 3.1.3 on Solaris 9) the command prints
1 because "-INF" and "+INF" are considered to be numeric strings.

There is an inconsistency here between numeric strings, where POSIX
requires that infinities and NaNs not be recognized, and ordinary
conversion of strings to numbers, where POSIX requires that infinities
and NaNs must be recognized.  For example, the POSIX awk expression
("-INF" + 0 < "+INF" + 0) must return 1, because atof returns
infinities (or at least, signed huge values) for the two strings.
This is inconsistent with the fact that "-INF" and "+INF" are not
considered to be numeric strings.

I see three possible fixes:

  1. The standard is correct as-is.  Conforming "awk" implementations
     must not consider infinities and NaNs to be numeric strings.

  2. The intent was for "awk" to disallow infinities and NaNs; add
     more restrictions that disallow them in strings (e.g., as the
     result of atof).

  3. The intent was for "awk" to use strtod(), so remove the
     restriction disallowing infinities and NaNs.

(1) is unsatisfactory, as it's not internally consistent.  Also, it
will be a pain to implement; e.g. "1e600" is a conforming numeric
string that evaluates to infinity, so an implementation won't be able
to simply invoke strtod, but will have to parse the string itself if
strtod returns infinity.

(2) is internally consistent, but is less useful for awk programmers.
It also suffers from some of the same performance problems as (1).

(3) is internally consistent, is more useful to programmers, and does
not suffer from the performance problems.  It's proposed below.

I should mention that this message follows up to XCU ERN 22
<http://www.opengroup.org/sophocles/show_mail.tpl?source=L&listname=austin-review-l&id=1783>,
which talks about a similar problem for hexadecimal floating-point
numbers.  The problems are related and perhaps should be considered
together.


Action:

Change XCU page 177 lines 6074-6081 from this:

   and after all the following conversions have been applied, the
   resulting string would lexically be recognized as a NUMBER token as
   described by the lexical conventions in Grammar:

     * All leading and trailing <blank>s are discarded.

     * If the first non- <blank> is '+' or '-' , it is discarded.

     * Changing each occurrence of the decimal point character from
       the current locale to a period.

   If a '-' character is ignored in the preceding description, the
   numeric value of the numeric string shall be the negation of the
   numeric value of the recognized NUMBER token. Otherwise, the

to this:

   and after the equivalent of the following calls to functions
   defined by the ISO C standard, string_value_end differs from
   string_value, and all characters in string_value_end are <blank>s.

     char *string_value_end;
     setlocale(LC_NUMERIC, "");
     numeric_value = strtod (string_value, &string_value_end);

   The

Append the following text to the awk rationale, after XCU page 185
line 7300:

   Historical implementations of awk did not support floating-point
   infinities and NaNs in data, e.g., "-INF" and "NaN".  Because C99
   required support for these constants in atof(), support for them is
   now required in awk.  This is a silent change to the behavior of
   awk programs; for example, in the POSIX locale the expression
   ("-INF" + 0 < 0) formerly returned 0 because "-INF" converted to
   zero, but now returns 1 because "-INF" converts to negative
   infinity or to a huge negative value.  Due to an oversight, the
   2001 through 2004 editions of this standard did not allow support
   for infinities and NaNs in numeric strings, but this has been
   corrected in this edition.

<Prev in Thread] Current Thread [Next in Thread>