Defect report from : Mark Ziegast , SHware Systems
(Please direct followup comments direct to yyyyyyyyyyyyyy@xxxxxxxxxxxxx)
@ page 119 line 3824 section 6.4 objection {030224-01}
Problem:
Defect code : 2. Omission
Synopsis:
Incomplete specification of encoding requirements for single-shift state
charsets.
Background:
(from C950.pdf)
3842 In lines defining ranges of symbolic names, the encoded value shall be the
value for the first symbolic name in the range (the symbolic name preceding the
ellipsis). Subsequent symbolic names defined by the range shall have encoding
values in increasing order. Bytes shall be treated as unsigned octets, and
carry shall be propagated between the bytes as necessary to represent the
range. For example, the line:
3847 <j0101>...<j0104> \d129\d254
3848 is interpreted as:
3849 <j0101> \d129\d254
3850 <j0102> \d129\d255
3851 <j0103> \d130\d0
3852 <j0104> \d130\d1
3853 The comment is optional.
Also,
3885 6.4.1 State-Dependent Character Encodings
3886 This section addresses the use of state-dependent character encodings
(that is, those in which the encoding of a character is dependent on one or
more shift codes that may precede it). A single-shift encoding (where each
character not in the initial shift state is preceded by a shift code) can be
defined in the charmap format if each shift-code/character sequence is
considered a multi-byte character, defined using the concatenated-constant
format described in Section 6.4 (on page 117).
as exampled above for a wide char range, and
3719 While in the initial shift state, all characters in the portable character
set shall retain their usual interpretation and shall not alter the shift
state. The interpretation for subsequent bytes in the sequence shall be a
function of the current shift state. A byte with all bits zero shall be
interpreted as the null character independent of shift state. Thus a byte with
all bits zero shall never occur in the second or subsequent bytes of a
character.
Problem:
As exampled by the definition of the wide character range <j0101-j0104> above,
a multi-byte character definition is allowed to have a <nul> as a second or
subsequent byte. It's also missing the leading zeroes in the last two lines of
the range expansion, but that's editorial. A single-shift encoding is expected
to use the same formats for individual chars and ranges of chars, yet by Par.
3719 can NOT have a second byte as <nul>, i.e. for a range that carries over a
byte boundary the next legitimate value is /d01, not /d00, yet there is no
current way to distinguish the intent of a given char or range declaration as
defining wide chars or single-shift chars. Nor is there a requirement an
implementation declaring single-shift chars can not use the range notation when
it has multiple shift chars with consecutive byte values in a charset it wishes
to define. Thus a parser could blithely generate <nul>s, even though the
chardef file otherwise looks perfectly legitimate to it.
Action:
This is a bit wordy to adequately show context
Change lines 3824-3835 to read:
3824 The encoding part indicates the numeric code of an individual character
value or the first character of a range of values. For a single-byte character
value the code is specified as a decimal, octal, or hexadecimal constant in one
of the following formats:
3826 "%cd%u", <escape_char>, <decimal byte value>
3827 "%cx%x", <escape_char>, <hexadecimal byte value>
3828 "%c%o", <escape_char>, <octal byte value>
3829 Decimal constants shall be represented by two or three decimal digits,
preceded by the escape character and the lowercase letter ’d’; for example,
"\d05", "\d97", or "\d143". Hexadecimal constants shall be represented by two
hexadecimal digits, preceded by the escape character and the lowercase letter
’x’; for example, "\x05", "\x61", or "\x8f". Octal constants shall be
represented by two or three octal digits, preceded by the escape character; for
example, "\05", "\141", or "\217". In a portable charmap file, each constant
represents an 8-bit byte.
A multi-byte character value is encoded by concatenating the constants
describing the individual byte values. When encoding a single-shift state
character or range, the lowercase letter 's' shall be prepended to the combined
constant to distinguish it from the encoding for a wide character. When
constants... etc.
and Change lines 3842-3852 to read:
3842 In lines defining ranges of symbolic names, the encoded value shall be the
value for the first symbolic name in the range (the symbolic name preceding the
ellipsis). Subsequent symbolic names defined by the range shall have encoding
values in increasing order. Bytes shall be treated as unsigned octets, and
carry shall be propagated between the bytes as necessary to represent the
range, skipping the value "\d00" for a single-shift state character as any
partial constant. As examples:
the wide character range encoding <j0101>...<j0104> \d129\d254 is interpreted
as:
<j0101> \d129\d254
<j0102> \d129\d255
<j0103> \d130\d00
<j0104> \d130\d01
and the single-shift state encoding <j0101>...<j0104> s\d129\d254
is interpreted as:
<j0101> s\d129\d254
<j0102> s\d129\d255
<j0103> s\d130\d01
<j0104> s\d130\d02
3853 The comment is optional.
Rationale:
Providing notations for single chars and ranges is a "good idea", IMO, and this
gives equal flexibility to wide chars and single shift chars in using either
method, keeping in mind a <nul> is not a valid element of a shifted char (which
I think is another good idea). As support for wide characters is required
because of the wide character interfaces, it is chosen as the default and the
shift-state encodings, being wholly optional but permitted specifically, are
required to indicate with the 's' they are being defined and need the special
handling. Could also append the 's', as in "\d129\d254s", and reword
accordingly, but this would slow a parser down, I would think.
The alternatives, that I can see, is to insert verbiage somewhare to
specifically prohibit shift-state char ranges from passing 255 as any part of
their encoding, or from using range notation altogether, to go with the
limitaion a single char definition will not have a nul as any of it's bytes.
I'm against this as it would bloat the definition files for some charsets
unnecessarily, and doesn't solve the problem of whether the implementation
should store the value as successive bytes in the order given or whatever way
it stores wide char values.
A third one could be to require wide char definitions to begin with a leading
<nul> part that would be stripped off and not used in determing the byte width
of the char, just as a signal that a wide char is being defined. This bloats
chardef files that do define wide charsets, however.
Future Directions:
This is stuff I think would be pretty easy to implement from a parsing and
syntax analysis standpoint given the current standard and this fix as a base
that I wouldn't mind seeing fill out this chapter a bit more. Conceptually, I
view them as rounding out the combinatorial possibiliities of the syntax
elements already present and leaving locking codes as implementtion defined.
Adding keywords that would allow shifted chars to be mapped to wide char
equivalents in the chardef file is potentially useful. The standard already
specifies a basic mapping between the byte and wide "POSIX" char sets, and this
could generalize it so it could be portable to other charsets as well. For
flexibility, syntax could be added that allows referencing external chardef
files, so a single file that specifies a large wide charset could be subindexed
by smaller byte width chardef files.
Allowing, for example, a "Sconst,const[,const...]" or "const,const[,const...]S"
construct to specify single-shifted wide characters, and extending the mapping
facility onto a range, or ranges, of single-shifted byte chars in the same or
another file. Something like this could be useful to applications that work
with larger subsets of Unicode-32, where a few shift-state codes could prevent
a chardef from jumping to needing a three byte width versus two for wide char
storage.
Providing a syntax for allowing previously defined ranges of unshifted wide
chars to specify a range of shifted byte values, and vice versa, with the back
and forth mapping implicit. To ease this a means for aliasing symbolically
groups of ranges could also be implemented, and usable perhaps in constructing
LC_TYPE definitions in localedef files.
|