| To: | austin-group-l@xxxxxxxxxxxxx |
|---|---|
| Subject: | Re: multibyte C locale |
| From: | shwaresyst@xxxxxxx |
| Date: | Sat, 31 Oct 2009 12:53:06 -0400 |
| References: | <4AE83F49.1090805@xxxxxx><8CC273626A167C9-708C-A347@webmail-m008.sysops.aol.com><4AEAA953.7070605@xxxxxx><20091030093610.GA31100@xxxxxx><20091030131925.GH28296@xxxxxx><20091030140706.GA12871@xxxxxx> <20091030181139.GI28296@xxxxxx> <8CC27A9B7A337AB-1BD0-9799@webmail-d053.sysops.aol.com> <OF061B1E55.0E7681F0-ON8525765F.006B7217-8525765F.006BDF48@xxxxxx> <8CC27B486F71509-1BD0-AA5A@webmail-d053.sysops.aol.com> <8CC27BC549BC800-55E0-15499@webmail-d051.sysops.aol.com> <OFB90F3857.A3B4F344-ON8525765F.0074CF82-8525765F.0074FC09@xxxxxx> |
|
It would be nice, yes... I've no objection to a standard locale being
defined that also includes a requirement to process UTF-8 as a superset
of the POSIX locale. However, until the POSIX standard interfaces are
allowed to call the mb* functions in the handling of the functions in
string.h and wchar.h, and the C standard *requires* the comparison and
assignment operators to function correctly lexically for ALL varieties
of shifted encodings, i.e. the definition of 'char' shall behave always
as pointer to bytes and not as a single byte that might be pointed to
or defines an explicit mbchar type, I can't see making an exception for
UTF-8 in the locale designed to be the smallest compatible for the
requirements of the C standard. If you look at all the descriptions for
the mb* functions, they include a statement like, e.g. at XSH line
41759, "The implementation shall behave as if no function defined in
this volume of POSIX.1-200x calls mb*( )." Sorry if my thinking has been fuzzy, in how I've expressed it, but I wouldn't vote for any attempt to relax the restriction. It is what it is; I do believe before it could be approved by POSIX, changes of this nature would have to be incorporated in the C standard and the C working group should be lobbied first. Mark -----Original Message----- From: Glen Seeds <Glen.Seeds@xxxxxx> To: shwaresyst@xxxxxx Cc: austin-group-l@xxxxxx Sent: Fri, Oct 30, 2009 5:17 pm Subject: Re: multibyte C locale The whole point of this discussion is that UTF-8 was in fact carefully crafted that way, and we want conforming programs to be use it for the POSIX locale. /glen From: shwaresyst@xxxxxx To: austin-group-l@xxxxxx Date: 2009-10-30 05:10 PM Subject: Re: multibyte C locale Clarifying addendum: a code set matching that criteria could be defined so that a non-PCS character could refer to different code points depending whether it was prefaced by a state changing character or not, i.e. be a non-state-changing code if encountered first, or a state-continuation code if encountered after a state-changing code. A test for whether it is a PCS char or not is still simple, but a routine trying to do GetNextChar might still fail if passed a pointer into the middle of a multi-byte sequence. UTF-8 may have been designed to avoid this, but as written it still allows for sets that aren't as carefully crafted. Mark -----Original Message----- From: shwaresyst@xxxxxx To: Glen.Seeds@xxxxxx Cc: austin-group-l@xxxxxx Sent: Fri, Oct 30, 2009 4:13 pm Subject: Re: multibyte C locale I don't believe so. This would still force all applications doing lexical analysis to use routines that need to include extra logic to test whether a given byte is or isn't a state-changing code that it might need to account for, even if just to throw that code and the next byte or bytes away from lexical consideration before continuing processing, and not simply a non-state -changing code that isn't part of the PCS which can be disregarded. For applications like the C compiler, when doing a rebuild of a million or more lines of code, this could noticeably add to the processing time required to complete the task, I'd think. Cheers, Mark -----Original Message----- From: Glen Seeds <Glen.Seeds@xxxxxx> To: shwaresyst@xxxxxx Cc: austin-group-l@xxxxxx Sent: Fri, Oct 30, 2009 3:38 pm Subject: Re: multibyte C locale I believe that would make a lot of working applications non-conformant. Could we say: In the POSIX locale, a character from the portable character set must not have a state-dependent encoding. For characters that have state-dependent encoding, the encoding of each part must be distinct from the coding of all portable characters. /glen |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | Re: multibyte C locale, Albert Cahalan |
|---|---|
| Next by Date: | [Online Pubs 0000175]: HTML problem in XCU7 sleep example], Scott Lurndal |
| Previous by Thread: | Re: multibyte C locale, Glen Seeds |
| Next by Thread: | Re: multibyte C locale, Glen Seeds |
| Indexes: | [Date] [Thread] [All Lists] |