-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
According to Glenn Fowler on 10/30/2009 10:14 PM:
> I'd like a clarification of what is being discussed
>
> consider the file bytes.dat that contains all of the
> bytes from 0 through 255 inclusive and in order
>
> export LC_CTYPE=C
> wc -c bytes.dat
Required to output 256 - there are exactly that many bytes, regardless of
the encoding.
> wc -m bytes.dat
If successful, this is required to output at least 103 (the number of
entries in the portable character set), and at most 256 (the number of
bytes), but the result depends on the charset chosen for the C locale. It
is also acceptable for wc to fail with non-zero status and a message about
the fact that an invalid character sequence was encountered. The fact
that the file contains bytes that do not correspond to the portable
character set required of the POSIX locale means that this is no longer a
compliant environment, so the standard is silent on what the correct
answer will be, and a portable application should not be expecting a
particular answer for this usage.
> do the 2 wc's produce the same output or not?
They can, but don't have to.
> or does the answer depend on if LC_CTYPE=C uses a UTF-8 encoding?
It doesn't even matter if LC_CTYPE=C uses a unibyte encoding. wc -m all
boils down to which byte sequences are defined as characters, and the
POSIX locale only guarantees the portable character set.
Furthermore, recall that historical Unix had only wc -c. wc -m was a
subsequent invention of the committee to rectify the fact that -c counted
bytes, not characters, but many users want to count characters. So the
fact that LC_ALL=C wc -m is undefined on bytes.dat is no change from
historical behavior.
- --
Don't work too hard, make some time for fun as well!
Eric Blake ebb9@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkrrwkoACgkQ84KuGfSFAYBGUgCfQ70I48djBLXexBb16kWKCG6X
CX8AnRWWOviO9X5NS9VqFVUIuNrTJxk2
=cHqP
-----END PGP SIGNATURE-----
|