Email List: Xaustin-group-lX
[All Lists]

Re: AI 2000-05-010: proposed interface

To: yyyyyyy@xxxxxxxxxx (Ulrich Drepper)
Subject: Re: AI 2000-05-010: proposed interface
From: "Sandra O'donnell USG" <yyyyyyyy@xxxxxxxxxxx>
Date: Wed, 16 Aug 2000 10:37:13 -0400
Cc: yyyyyyyyyyyy@xxxxxxxxxxxxx, yyyyyyyy@xxxxxxxxxxx
   . . .
   > Are you one of those who might use this interface?
   
   No.  I'm writing the libc and have access to the internal data
   structures.

So which users have requested this functionality? Although I agree
there's no easy way to get the collation sequence information today,
who needs this? To do what?
   
   > If so, would you expect to produce significantly different output
   > for your regular expressions than is typical for users of POSIX
   > internationalized OSes?
   
   I don't know.  All the Unices we have (means all of them) are crippled
   by being the US-only version.  We don't pay extra to get the language
   extensions and therefore I have no idea what other OSes do.
   
Most of the major Unix vendors ship systems with many locales
already included. You may have to pay more for large fonts, some
translations, etc. (each vendor varies, of course), but I'm surprised
that you are only seeing US stuff.

   > I'm asking because I remember you were concerned about the existing
   > behavior of internationalized reg ex's, because you didn't want
   > (among other things) case-folding. That is, you didn't want a
   > range like [a-c] to match
   > 
   > a A b B c
   
   Right.  And this is addressed meanhwile by using this kind of
   information.
   
How do the APIs address this? Assume the sequence is defined
in a case-mixed way, the APIs exist, and a program uses them
to access the sequence. Now what?

   . . .
   > Will collation *order* and collation *sequence* truly be different
   > things? Will there be any localedef syntax changes?
   
   No localedef syntax changes.  I'm just emitting another table.  And
   yes, the order and the sequence are very different.  Look at the
   LC_COLLATE specification in ISO 14651 (which is actually a standard)
   
First, the JTC1/SC22/WG20 Web site I've looked at says ISO 14651 is still in
the third FCD, not that it has been finalized. Either way, its status
is irrelevant. There are hundreds of locales on existing implementations,
and they won't all go away because 14651 exists. Any proposed APIs must
deal with the reality of sequences and orders defined in these existing
locales.

However, if I look at the sequence in 14651 as an example, I see this:

. . .
<U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A
<UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A
<U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A
<U24D0> <S0061>;<BASE>;<CIRCLE>;<U24D0> % CIRCLED LATIN SMALL LETTER A
<U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
<UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A
<U24B6> <S0061>;<BASE>;<CIRCLECAP>;<U24B6> % CIRCLED LATIN CAPITAL LETTER A
<U00AA> <S0061>;<BASE>;<MNN>;<U00AA> % FEMININE ORDINAL INDICATOR
<U00E1> <S0061>;"<BASE><AIGUT>";"<MIN><MIN>";<U00E1> % LATIN SMALL LETTER A 
WITH ACUTE
<U00C1> <S0061>;"<BASE><AIGUT>";"<CAP><MIN>";<U00C1> % LATIN CAPITAL LETTER A 
WITH ACUTE
<U00E0> <S0061>;"<BASE><GRAVE>";"<MIN><MIN>";<U00E0> % LATIN SMALL LETTER A 
WITH GRAVE
<U00C0> <S0061>;"<BASE><GRAVE>";"<CAP><MIN>";<U00C0> % LATIN CAPITAL LETTER A 
WITH GRAVE
. . .
many other A's-with-diacritics
. . .
<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
<UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B
<U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B
<U24D1> <S0062>;<BASE>;<CIRCLE>;<U24D1> % CIRCLED LATIN SMALL LETTER B
<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
. . .
many other B's-with-diacritics
. . .

The *sequence* of these characters is such that all versions of A's
come before all versions of B's which come before all versions of
C's, and so on. Thus, in the sequence, lowercase and uppercase
are intermixed.

The *order* is that lowercase (<MIN>) comes before uppercase (<CAP>),
and other weights (<GRAVE>,<ACUTE>,<WIDECAP>,etc.) apply as specified,
but an uppercase A still comes before a lowercase b. How does either
the sequence or the order as defined in 14651 address your perceived
problem with respect to [a-c] matching "a A b B c"?

   > There are no interfaces for directly retrieving information in other
   > parts of the locale. For example, there are no APIs for getting the
   > info in the LC_CTYPE section.
   
   Sure there is.  You can call the is*() and to*() functions for each
   and every character.

Aaah, we have different definitions of "retrieving information." 
I mean (and thought you meant) that there are no APIs for listing
the values in the LC_CTYPE section. For LC_TIME, LC_MONETARY, etc.,
you can call localeconv() or nl_langinfo() to retrieve the values
defined in these sections. For example, you can get the list of
month names as defined in the locale. There are no APIs, however, that
return the list of characters defined as alpha in a given locale. You
can pass a character to is[w]alpha() and ask if it is alpha in that
locale, but that's different. At least in my mind.

                -- Sandra
-----------------------
Sandra Martin O'Donnell
Compaq Computer Corporation
yyyyyyyyyyyyyyy@xxxxxxxxxx
yyyyyyyy@xxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>