Email List: Xaustin-group-lX
[All Lists]

Re: multibyte C locale

To: wollman+austin-group@xxxxxxxxxxx
Subject: Re: multibyte C locale
From: Albert Cahalan <albert@xxxxxxxxxxxxxxxxxxxxx>
Date: Sat, 31 Oct 2009 06:10:37 -0400
Cc: Glen Seeds <Glen.Seeds@xxxxxxxxxx>, austin-group-l@xxxxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=9wEA3hdH4JOufeHPVOOXm0o+Il3uUsg3pM+c6dptCRU=; b=b/QTN8SXK1hE3QLqhhZHreFbO6q471fLsTouWiTAaDQFRN5FzUOnWyVnWvNJz26DXn rbR3K8Osf4HKSEQAJL2XP/kKurEzBD7pPKUH1hDuFXe/1SaUX7RsG/HnFnO2pQ/vzYjM zQYh1No9/0ElUlaprFfTM1Q6bYQn58LlovoDs=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=XpfZHToCWRmI4FtIsRdA0YzeJkGoPDjb49orEYLZS1Eyjedt8h02PCGfh7I9Pt9aJx QWvNB+iLcsLJoBASbIrW62VIHYF6Qv30GFceUMVcKz5hO/5bIGGDFRk0fxkXERAY83Wy 2IUP/wAcG7z0nklx5j5StWIXqT6Q+MaGSFh0U=
References: <4AE83F49.1090805@xxxxxx> <20091030093610.GA31100@xxxxxx> <20091030131925.GH28296@xxxxxx> <20091030140706.GA12871@xxxxxx> <20091030181139.GI28296@xxxxxx> <8CC27A9B7A337AB-1BD0-9799@webmail-d053.sysops.aol.com> <4AEB5378.5070402@xxxxxx> <19179.21604.450124.747123@xxxxxx> <OF7977394B.18C83951-ON8525765F.0077969C-8525765F.0077A6A5@xxxxxx> <19179.25274.524418.169871@xxxxxx>
On Fri, Oct 30, 2009 at 6:03 PM,  <wollman+austin-group@xxxxxx> wrote:
> <<On Fri, 30 Oct 2009 17:46:54 -0400, Glen Seeds <Glen.Seeds@xxxxxx> said:

>>> [I wrote:]
>>> Turn it around: of what value is the POSIX locale without such a
>>> requirement?
>
>> The value is considerable, if we can find a way to accommodate UTF-8.
>
> I don't see it.  What's the use case?  I can see a value to
> applications in being able to configure a locale that acts like
> pre-locale Unix and C did; I don't see a value in configuring a locale
> that doesn't behave like traditional Unix, but isn't usefully
> localized either.  ("Traditional Unix" behavior is normally what I
> want pretty much all the time.)

C.UTF-8 would be damn nice. Dealing with UTF-8 text is rather
important these days. Unfortunately, locales like en_US.UTF-8
get all stupid with collating order. 'a' comes after 'Z' damn it!
(that is, U+0061 is a bigger number than U+005A) Possibly
there is some hack with multiple locale variables that will make
things sane, but that's excessively painful for such a common case.

I expect the plain C locale to cover U+0000 through U+00FF,
but it does not. At least with glibc, stuff like U+00E0 fails the
isalpha() test. Ouch, this is broken too.

<Prev in Thread] Current Thread [Next in Thread>