Email List: Xaustin-review-lX
[All Lists]

Defect in XBD 6.4

To: yyyyyyyyyyyyyyy@xxxxxxxxxxxxx
Subject: Defect in XBD 6.4
From: yyyyyyyyyy@xxxxxxx
Date: Tue, 25 Feb 2003 08:26:50 GMT
        Defect report from : Mark Ziegast , SHware Systems

(Please direct followup comments direct to yyyyyyyyyyyyyy@xxxxxxxxxxxxx)

@ page 119 line 3824 section 6.4 objection {030224-01}

Problem:

Defect code :  2. Omission

Synopsis: 

Incomplete specification of encoding requirements for single-shift state 
charsets.

Background:
(from C950.pdf)

3842 In lines defining ranges of symbolic names, the encoded value shall be the 
value for the first symbolic name in the range (the symbolic name preceding the 
ellipsis). Subsequent symbolic names defined by the range shall have encoding 
values in increasing order. Bytes shall be treated as unsigned octets, and 
carry shall be propagated between the bytes as necessary to represent the 
range. For example, the line:
3847 <j0101>...<j0104> \d129\d254
3848 is interpreted as:
3849 <j0101> \d129\d254
3850 <j0102> \d129\d255
3851 <j0103> \d130\d0
3852 <j0104> \d130\d1
3853 The comment is optional.

Also,

3885 6.4.1 State-Dependent Character Encodings
3886 This section addresses the use of state-dependent character encodings 
(that is, those in which the encoding of a character is dependent on one or 
more shift codes that may precede it). A single-shift encoding (where each 
character not in the initial shift state is preceded by a shift code) can be 
defined in the charmap format if each shift-code/character sequence is 
considered a multi-byte character, defined using the concatenated-constant 
format described in Section 6.4 (on page 117).

as exampled above for a wide char range, and

3719 While in the initial shift state, all characters in the portable character 
set shall retain their usual interpretation and shall not alter the shift 
state. The interpretation for subsequent bytes in the sequence shall be a 
function of the current shift state. A byte with all bits zero shall be 
interpreted as the null character independent of shift state. Thus a byte with 
all bits zero shall never occur in the second or subsequent bytes of a 
character.

Problem:

As exampled by the definition of the wide character range <j0101-j0104> above, 
a multi-byte character definition is allowed to have a <nul> as a second or 
subsequent byte. It's also missing the leading zeroes in the last two lines of 
the range expansion, but that's editorial. A single-shift encoding is expected 
to use the same formats for individual chars and ranges of chars, yet by Par. 
3719 can NOT have a second byte as <nul>, i.e. for a range that carries over a 
byte boundary the next legitimate value is /d01, not /d00, yet there is no 
current way to distinguish the intent of a given char or range declaration as 
defining wide chars or single-shift chars. Nor is there a requirement an 
implementation declaring single-shift chars can not use the range notation when 
it has multiple shift chars with consecutive byte values in a charset it wishes 
to define. Thus a parser could blithely generate <nul>s, even though the 
chardef file otherwise looks perfectly legitimate to it.

Action:

This is a bit wordy to adequately show context

Change lines 3824-3835 to read:

3824 The encoding part indicates the numeric code of an individual character 
value or the first character of a range of values. For a single-byte character 
value the code is specified as a decimal, octal, or hexadecimal constant in one 
of the following formats:

3826 "%cd%u", <escape_char>, <decimal byte value>
3827 "%cx%x", <escape_char>, <hexadecimal byte value>
3828 "%c%o", <escape_char>, <octal byte value>

3829 Decimal constants shall be represented by two or three decimal digits, 
preceded by the escape character and the lowercase letter ’d’; for example, 
"\d05", "\d97", or "\d143". Hexadecimal constants shall be represented by two 
hexadecimal digits, preceded by the escape character and the lowercase letter 
’x’; for example, "\x05", "\x61", or "\x8f". Octal constants shall be 
represented by two or three octal digits, preceded by the escape character; for 
example, "\05", "\141", or "\217". In a portable charmap file, each constant 
represents an 8-bit byte.

A multi-byte character value is encoded by concatenating the constants 
describing the individual byte values. When encoding a single-shift state 
character or range, the lowercase letter 's' shall be prepended to the combined 
constant to distinguish it from the encoding for a wide character. When 
constants... etc.

and Change lines 3842-3852 to read:

3842 In lines defining ranges of symbolic names, the encoded value shall be the 
value for the first symbolic name in the range (the symbolic name preceding the 
ellipsis). Subsequent symbolic names defined by the range shall have encoding 
values in increasing order. Bytes shall be treated as unsigned octets, and 
carry shall be propagated between the bytes as necessary to represent the 
range, skipping the value "\d00" for a single-shift state character as any 
partial constant. As examples: 

the wide character range encoding <j0101>...<j0104> \d129\d254 is interpreted 
as:
 <j0101> \d129\d254
 <j0102> \d129\d255
 <j0103> \d130\d00
 <j0104> \d130\d01

and the single-shift state encoding <j0101>...<j0104> s\d129\d254
is interpreted as:
 <j0101> s\d129\d254
 <j0102> s\d129\d255
 <j0103> s\d130\d01
 <j0104> s\d130\d02

3853 The comment is optional.

Rationale:

Providing notations for single chars and ranges is a "good idea", IMO, and this 
gives equal flexibility to wide chars and single shift chars in using either 
method, keeping in mind a <nul> is not a valid element of a shifted char (which 
I think is another good idea). As support for wide characters is required 
because of the wide character interfaces, it is chosen as the default and the 
shift-state encodings, being wholly optional but permitted specifically, are 
required to indicate with the 's' they are being defined and need the special 
handling. Could also append the 's', as in "\d129\d254s", and reword 
accordingly, but this would slow a parser down, I would think.

The alternatives, that I can see, is to insert verbiage somewhare to 
specifically prohibit shift-state char ranges from passing 255 as any part of 
their encoding, or from using range notation altogether, to go with the 
limitaion a single char definition will not have a nul as any of it's bytes. 
I'm against this as it would bloat the definition files for some charsets 
unnecessarily, and doesn't solve the problem of whether the implementation 
should store the value as successive bytes in the order given or whatever way 
it stores wide char values.

A third one could be to require wide char definitions to begin with a leading 
<nul> part that would be stripped off and not used in determing the byte width 
of the char, just as a signal that a wide char is being defined. This bloats 
chardef files that do define wide charsets, however.

Future Directions: 

This is stuff I think would be pretty easy to implement from a parsing and 
syntax analysis standpoint given the current standard and this fix as a base 
that I wouldn't mind seeing fill out this chapter a bit more. Conceptually, I 
view them as rounding out the combinatorial possibiliities of the syntax 
elements already present and leaving locking codes as implementtion defined.

Adding keywords that would allow shifted chars to be mapped to wide char 
equivalents in the chardef file is potentially useful. The standard already 
specifies a basic mapping between the byte and wide "POSIX" char sets, and this 
could generalize it so it could be portable to other charsets as well. For 
flexibility, syntax could be added that allows referencing external chardef 
files, so a single file that specifies a large wide charset could be subindexed 
by smaller byte width chardef files.

Allowing, for example, a "Sconst,const[,const...]" or "const,const[,const...]S" 
construct to specify single-shifted wide characters, and extending the mapping 
facility onto a range, or ranges, of single-shifted byte chars in the same or 
another file. Something like this could be useful to applications that work 
with larger subsets of Unicode-32, where a few shift-state codes could prevent 
a chardef from jumping to needing a three byte width versus two for wide char 
storage.

Providing a syntax for allowing previously defined ranges of unshifted wide 
chars to specify a range of shifted byte values, and vice versa, with the back 
and forth mapping implicit. To ease this a means for aliasing symbolically 
groups of ranges could also be implemented, and usable perhaps in constructing 
LC_TYPE definitions in localedef files.

<Prev in Thread] Current Thread [Next in Thread>