Email List: Xaustin-group-lX
[All Lists]

Re: multibyte C locale

To: austin-group-l@xxxxxxxxxxxxx
Subject: Re: multibyte C locale
From: shwaresyst@xxxxxxx
Date: Sat, 31 Oct 2009 12:53:06 -0400
References: <4AE83F49.1090805@xxxxxx><8CC273626A167C9-708C-A347@webmail-m008.sysops.aol.com><4AEAA953.7070605@xxxxxx><20091030093610.GA31100@xxxxxx><20091030131925.GH28296@xxxxxx><20091030140706.GA12871@xxxxxx> <20091030181139.GI28296@xxxxxx> <8CC27A9B7A337AB-1BD0-9799@webmail-d053.sysops.aol.com> <OF061B1E55.0E7681F0-ON8525765F.006B7217-8525765F.006BDF48@xxxxxx> <8CC27B486F71509-1BD0-AA5A@webmail-d053.sysops.aol.com> <8CC27BC549BC800-55E0-15499@webmail-d051.sysops.aol.com> <OFB90F3857.A3B4F344-ON8525765F.0074CF82-8525765F.0074FC09@xxxxxx>
It would be nice, yes... I've no objection to a standard locale being defined that also includes a requirement to process UTF-8 as a superset of the POSIX locale. However, until the POSIX standard interfaces are allowed to call the mb* functions in the handling of the functions in string.h and wchar.h, and the C standard *requires* the comparison and assignment operators to function correctly lexically for ALL varieties of shifted encodings, i.e. the definition of 'char' shall behave always as pointer to bytes and not as a single byte that might be pointed to or defines an explicit mbchar type, I can't see making an exception for UTF-8 in the locale designed to be the smallest compatible for the requirements of the C standard. If you look at all the descriptions for the mb* functions, they include a statement like, e.g. at XSH line 41759, "The implementation shall behave as if no function defined in this volume of POSIX.1-200x calls mb*( )."

Sorry if my thinking has been fuzzy, in how I've expressed it, but I wouldn't vote for any attempt to relax the restriction. It is what it is; I do believe before it could be approved by POSIX, changes of this nature would have to be incorporated in the C standard and the C working group should be lobbied first.

Mark


-----Original Message-----
From: Glen Seeds <Glen.Seeds@xxxxxx>
To: shwaresyst@xxxxxx
Cc: austin-group-l@xxxxxx
Sent: Fri, Oct 30, 2009 5:17 pm
Subject: Re: multibyte C locale

The whole point of this discussion is
that UTF-8 was in fact carefully crafted that way, and we want conforming
programs to be use it for the POSIX locale.

/glen


From:
shwaresyst@xxxxxx

To:
austin-group-l@xxxxxx

Date:
2009-10-30 05:10 PM

Subject:
Re: multibyte C locale

Clarifying addendum: a code set matching that criteria
could be defined
so that a non-PCS character could refer to different code points
depending whether it was prefaced by a state changing character or not,
i.e. be a non-state-changing code if encountered first, or a
state-continuation code if encountered after a state-changing code. A
test for whether it is a PCS char or not is still simple, but a routine
trying to do GetNextChar might still fail if passed a pointer into the
middle of a multi-byte sequence. UTF-8 may have been designed to avoid
this, but as written it still allows for sets that aren't as carefully
crafted.



Mark

-----Original Message-----

From: shwaresyst@xxxxxx
To: Glen.Seeds@xxxxxx
Cc: austin-group-l@xxxxxx
Sent: Fri, Oct 30, 2009 4:13 pm
Subject: Re: multibyte C locale

I don't believe so. This would still force all applications doing
lexical analysis to use routines that need to include extra logic to
test whether a given byte is or isn't a state-changing code that it
might need to account for, even if just to throw that code and the next
byte or bytes away from lexical consideration before continuing
processing, and not simply a non-state -changing code that isn't part
of the PCS which can be disregarded. For applications like the C
compiler, when doing a rebuild of a million or more lines of code, this
could noticeably add to the processing time required to complete the
task, I'd think.

Cheers,
Mark

-----Original Message-----
From: Glen Seeds <Glen.Seeds@xxxxxx>
To: shwaresyst@xxxxxx
Cc: austin-group-l@xxxxxx

Sent: Fri, Oct 30, 2009 3:38 pm
Subject: Re: multibyte C locale

I believe that would make a lot of working
applications non-conformant. Could we say:

In the POSIX locale, a character from the portable character
set
must not have a state-dependent encoding.
For characters that have
state-dependent encoding, the encoding
of each part must be distinct
from the coding of all portable characters.

/glen

<Prev in Thread] Current Thread [Next in Thread>