OSF DCE SIG A. Thormodsen (HP) Request For Comments: 13.0 August 1992 DCE 1.1 INTERNATIONALIZATION REQUIREMENTS 1. INTRODUCTION AND SUMMARY This paper explains in detail the high-priority internationalization requirements for the OSF DCE 1.1, as recently determined by the DCE SIG. It also provides some background on the motivations behind these requirements and on the relationship between these requirements and the generic internationalization requirements for all OSF components. 1.1. Summary of Requirements The internationalization requirements on the DCE 1.1 are in two groups: mandatory base-level requirements and DCE-specific requirements. The base-level requirements are not prioritized. All must be met to bring the DCE to a minimum level of internationalization. These are: (a) REQT: 8-bit/multibyte "clean", no corruption of non-ASCII data, no unnecessary restrictions of text data to ASCII. (b) REQT: XPG-3-style message catalog support included for all user-visible text. Messages should be in a common, agreed upon, format and default messages should be supplied for ease of serviceability. (c) REQT: Support for internationalization functionality provided with a single source/single binary model. (d) REQT: DCE components individually tested both for functionality and performance with non-ASCII, and in particular multibyte, character data. The DCE 1.1 specific requirements, as prioritized by the DCE SIG, are: (a) REQT: Homogeneous Interoperability (one language/char set/encoding) (b) REQT: Follow Internationalization Standards (c) REQT: Provide a Portable Character Set Thormodsen Page 1 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 (d) REQT: Support Standard Locales (e) REQT: Support a Universal Character Set and Encoding (f) REQT: Support Character Set and Encoding Independence Based on OSF's investigation of the DCE source code (see [Ogur]), it is apparent that the base-level requirements have not yet been met. It is particularly important that the message catalogs and testing requirements be met, since without these it is not clear that the DCE will be acceptable in the international marketplace. 2. BACKGROUND Until recently, support for more than one language or character set and encoding in an operating system or application was regarded as an "exotic" requirement. This is no longer true. The computer industry, and its customers, have become global enterprises. Most large software systems are sold world-wide today, in fact they must be in order to return enough profit. The goal of world-wide sales has often been accomplished by designing multiple systems, one for each country or language. A more efficient approach, however, it to provide support for multiple languages, character sets and encodings within one software system. This approach is what is meant by "internationalization". The DCE provides some unique challenges for internationalization. The DCE is intended to allow multiple computer systems to efficiently interoperate. From the standpoint of internationalization there are at least four possible generic situations encountered in a network of interoperating computer systems: (a) All systems share the same character set, encoding (single or multibyte) and language. (b) Systems use different character sets or encodings to represent the same language. (c) Systems use the same character set and encoding to represent different languages. (d) Systems use different character sets or encodings to represent different languages. The technology to address some of the issues raised by the situations described above will not be available for years. Other issues, especially those posed by the first two situations, can be, and need to be, addressed today. Thormodsen Page 2 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 The requirements discussed in this paper are motivated primarily by the first situation above, in which all systems under consideration are operating using the same (human) language, character set and encoding. (NOTE: It is also assumed in this case that all networked systems are running in the same locale, this assumption is discussed in more detail in section 3 below.) The second situation, in which the same language is represented with different sets, is partially addressed by the requirements discussed in this paper, but it isn't anticipated that a full solution to this problem will be needed until after the DCE 1.1 timeframe. The "portable character set" and "universal character set" requirements (see 4c and 4e) will provide a framework to allow an application developed under DCE 1.1 to function in a network which supports different character sets and encodings. The cost is a slightly more complex application design. The last two situations described above present unique difficulties which are not presented by the first two. In particular, the DCE will need to deal meaningfully with data from different systems which may be using completely different linguistic and cultural assumptions about character data handling. As a specific example, imagine a DCE application attempting to merge collated data from multiple systems, each of which is using a fundamentally different collation order. Despite the difficulties, some form of multilingual support will almost certainly be needed in the future, especially in economically important regions such as Western Europe. Today's DCE designs should not make assumptions which would prevent multilingual support in the future. Appendix B contains a much more in-depth report on the issues surrounding data interchange in an internationalized DCE environment. 3. REQUIRED BASE-LEVEL FUNCTIONALITY It is assumed that the DCE components will provide certain minimal internationalization functionality as a matter of course. These requirements are: (a) REQT: 8-bit/multibyte "clean", no corruption of non-ASCII data, no unnecessary restrictions of text data to ASCII. (b) REQT: XPG-3-style message catalog support included for all user-visible text. Messages should be in a common, agreed upon, format and default messages should be supplied for ease of serviceability. (Recently messaging has become a more critical issue due to concerns about serviceability in a networked environment. A particular concern is the tracing and logging of messages Thormodsen Page 3 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 arriving from various remote sites, possibly using different languages and character sets or encodings. This is discussed in more detail in Section 6.) (c) REQT: Support for internationalization functionality provided with a single source/single binary model. (d) REQT: DCE components individually tested both for functionality and performance with non-ASCII, and in particular multibyte, character data. These requirements are derived from [Klin]. There is also an important requirement on the locale data used by the various systems connected via the DCE: (a) REQT: The NLS locale data on various systems connected via the DCE must be consistent. Currently this must be done "manually", presumably by system administrators. This requirement is necessary because this is the only method currently available to insure that character data is handled the same on all DCE networked systems. In practice, this requirement would probably only be important for applications which do extensive, distributed, text processing (such as so-called "groupware" applications). In this case the application installation process itself would probably specify, or even provide, synchronized locales. In the future it is anticipated that the DCE itself may be able to resolve the complexities of locales in a distributed environment. Currently the technology to do this does not exist, although X/Open- UNIFORUM is currently examining and addressing some of these requirements. 4. HIGH PRIORITY DCE 1.1 REQUIREMENTS Below are the top six internationalization requirements for DCE 1.1, in priority order, with very brief descriptions. The entire prioritized list of requirements, and complete full descriptions, can be found in Appendix A. Note that some of the requirements in the list below have bearing on future plans, these can be deduced from the lower-prioritized requirements in this Appendix. In particular, several are prerequisites for regional heterogeneous interoperability, which fell just below the priority of the items below. (a) REQT: Homogeneous Interoperability (i.e., one language/char set/encoding) Thormodsen Page 4 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 The DCE must be able to support networks in which all clients and servers are using the same character set and encoding. (b) REQT: Follow Internationalization Standards The DCE shall follow all relevant formal standards in providing I18N functionality. (c) REQT: Provide a Portable Character Set OSF should identify, publish, implement, and promote the use of a collection of "portable characters" which may be used by any DCE application (see B.2 for a definition of a portable character set). (d) REQT: Support Standard Locales The DCE should support standard locales, if/when available, from these groups: ISO (highest priority), POSIX, X/OPEN. (e) REQT: Support a Universal Character Set and Encoding OSF should support a single, universal character set and encoding which may be used by any DCE component or application. (f) REQT: Support Character Set and Encoding Independence The DCE should be able to handle a wide variety of character sets and encoding methods, at a very minimum the character sets and encodings supported by OSF 1.1. 5. DISCUSSION OF REQUIREMENTS The six internationalization requirements which were given highest priority by the SIG form an interdependent set. Each requirement has implications which affect other requirements, as well as affecting the DCE in ways not directly related to any of the requirements. This section discusses the various implications of each requirement. 5.1. REQT: Homogeneous Interoperability (i.e., One Language/Char Set/Encoding) This is really the "master" requirement for DCE 1.1. Currently most computer networks share a common character set and encoding, and all users on a network share a common language. This requirement implies that an application developed with the DCE should function equally as well when running on any such network. For example, an internationalized application developed with the DCE should function both on a Japanese network using EUC encoded JIS, and on a French network using ISO 8859-1. (NOTE: "internationalized" implies that Thormodsen Page 5 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 the application will not need to be relinked, recompiled or rewritten to work in these different environments.) Beyond the technical implications for the various DCE components, this requirement is directly dependent on at least three other DCE I18N requirements, which are briefly discussed below. These requirements are also discussed in detail in sections 5.3, 5.4 and 5.5. (a) Provision of a portable character set It will be necessary for a DCE application to name and identify objects, especially "DCE owned" objects, regardless of the particular character set and encoding in use by the application and the network. This implies the existence of a "portable character set". (b) Support for standard locales It will be necessary for a DCE application consistently to perform character manipulations, data formatting, and similar locale-dependent operations. This implies the use of the same locale throughout a DCE-supporting network. For those cases where standard locales exist, it is anticipated that they will be used on such a network. Therefore they must be supported by the DCE. (c) Support character set and encoding independence This requirement is obvious from the description of "homogeneous interoperability" given above. A single, compiled, DCE application should be able to support any character set which the underlying system supports. This implies the use of character set/encoding independent interfaces such as the proposed XPG4 WPI within DCE components (see next item below for more discussion). 5.2. REQT: Follow Internationalization Standards This is a "good citizenship" requirement. Basically it is requesting that all DCE components provide messaging, international character data processing, local conventions and collations via standard interfaces and using standard data where available. The purpose of this request is twofold: First, it ensures that the DCE components are fully internationalized and second, it ensures that an application which is developed in conformance with international standards can be easily ported to operate under the DCE. An additional concern is the availability of standards-based interfaces on systems which the DCE will be ported to. Unfortunately not all systems can be guaranteed to support all standards-based Thormodsen Page 6 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 interfaces. It may be necessary to adopt a strategy similar to that used by the X Consortium for X-Windows. In this case the X-Windows source is distributed along with a minimally functional set of certain standard interfaces, such as the XPG4 WPI wide character interfaces. This facilitates rapid porting of prototype implementations. In particular, OSF should investigate what standards-based routines it already "owns" and how these might be bundled with the DCE. 5.3. REQT: Provide a Portable Character Set The DCE currently specifies a "portable character set" (see section B.2) without further specifying the use and implementation of this set. This is a serious flaw; further specification is needed from OSF. Most importantly, the DCE specifications should indicate explicitly what purposes this portable character set should be used for. The existing API's in the DCE components need to be classified as to whether they are restricted to use of the portable characters, a subset of these characters, or a superset. Furthermore, the specifications must be clarified with regard to encodings. It is the opinion of the Working Group that the DCE must specify one preferred encoding of these characters (presumably ISO IRV646), while also supporting alternative, vendor-proprietary, sets on homogeneous networks. Note that a mandatory encoding of the PCS is not being requested here, only a preferred default encoding. This requirement will clarify what is required to allow multivendor interoperability in a DCE network. It is not a requirement that the DCE support the simultaneous use of different encodings of the portable characters within one network. However, by specifying one standard default encoding the DCE can enable applications to interoperate, if they choose this encoding. 5.4. REQT: Support Standard Locales The principle motivation for this requirement is discussed in 5.1 above. For a distributed application to behave in a consistent way (probably for it to be usable at all) it must have access to the same character handling behavior and data formatting information on all systems. This information comes from the locale data. As mentioned in Section 3, it is the system administrators' responsibility to make sure that this data is consistent. Presumably, they will do this by resorting to some set of standard locales. Hopefully X/Open (currently in unofficial cooperation with ISO) will have a database of standard locales available before the end of this year. If not, OSF itself will need to supply such locales. The only requirement on the DCE is that it be TESTED with Thormodsen Page 7 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 these locales to ensure that homogeneous interoperability can actually be achieved. 5.5. REQT: Support a Universal Character Set/Encoding This requirement is a special case of the more general requirement for character set independence stated in 5.6 below. A universal character set/encoding (UCS) is capable of representing the writing systems of a large number of languages within one set. The most obvious current choice for a UCS is ISO 10646. While all of the possible uses of such a set are not apparent, there are some obvious applications: (a) A UCS can provide a convenient method of encoding characters for data interchange. This could be useful when communicating with systems using an unknown character set/encoding, or for providing data in a universal format to systems supporting various sets. (b) A UCS can provide the basis for multilingual applications. Some application developers are already considering the use of ISO 10646 (or the related UNICODE) for various personal computer applications such as multimedia mailers. The DCE should be designed to permit such applications to interoperate via the DCE. It is less important that the DCE use a UCS for internal purposes, such as object naming, but it should not prevent such use. The ISO 10646 standard is likely to affect other standards, such as X.500. The DCE needs to track any such changes. More aspects of the implementation and use of such a set are discussed at length in the Appendix B. 5.6. REQT: Support Character Set and Encoding Independence This requirement is also discussed under item 5.1 above. The same requirement holds for any internationalized software, that it be able to support a variety of character sets and encodings with a single compiled version. This implies that the DCE components must use internationalized interfaces, such as those specified in the proposed XPG4 WPI, to do all character handling. Furthermore, the DCE components should be designed in an internationalized way, without the use of hard-coded character constants and strings. Thormodsen Page 8 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 6. OTHER CONSIDERATIONS During the course of discussing this paper some issues came up which are related to DCE 1.1 requirements, but not explicitly tied to them. These are mentioned below, with no implied prioritization. 6.1. Messaging The 1.1 requirements above discuss message catalogs, which provide a way to bind an application to native language messages on a particular system. This model does not support the type of distributed environment implied by the DCE. In particular, it doesn't address the problem of servers needing to provide text messages (such as error reports) to clients which may be working in different languages. There needs to be a way of referencing a message independently of the language and character set of the message. This can be accomplished via unique identifiers embedded within localizable messages, or perhaps through some more advanced approach. Note that there are two slightly different uses of such an identifier: one is to identify a message which has been received but cannot be interpreted, another is to communicate a message to a system which is working in an unknown locale. As networks grow larger this serviceability issue will become critical. 6.2. Distributed Locales At several points in the above discussion, it was necessary to assume homogeneous locales across a network. This is a fairly strict requirement, and may not easily be met. It would be better if the DCE could somehow synchronize locales between clients and servers. The fundamental technology to implement such a system is currently being investigated by X/Open, these efforts should be monitored by the DCE developers. APPENDIX A. COMPLETE LIST OF DCE 1.1 I18N REQUIREMENTS Below is the complete list of DCE 1.1 internationalization requirements as voted on by the DCE SIG in November, 1991. The wording of these requirements is as they were voted on. In certain cases (notably portable characters) the exact specifications have changed slightly during further discussions. Refer to the main paper for more precise explanations. (a) REQT: Homogeneous Interoperability (one language/char set/encoding) The DCE must be able to support networks in which all clients and servers are using the same character set and encoding. No Thormodsen Page 9 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 assumptions about this specific character set and encoding should be made beyond the assumption of a consistent encoding of the Portable Character Set. In this configuration the user should not be able to distinguish between remote and local data access. (NOTE: An implication of this requirement is that identical locales must be available on all systems in a DCE network, and that all processes communicating via the DCE must be running in identical locales, if consistent character and data processing is required) (b) REQT: Follow Internationalization Standards The DCE shall follow formal standards in providing the following functionality: message catalogs, international character data processing, local conventions, collation. The recommended OSF prioritization of standards shall be applied in cases of conflicting standards. Refer to the OSF I18N SIG requirements list for further explanation of the relevant standards. (c) REQT: Provide a Portable Character Set OSF should identify, publish, implement, and promote the use of a collection of "portable characters" which may be used by any DCE application, anywhere in the world, regardless of the local character set and encoding in use. (NOTE: This is intended to imply that the ASCII name "ABC" is portable to an EBCDIC system, which further implies that "portable character" data is identifiable in some way so that the necessary conversions can be done. This doesn't necessarily further imply tagged data, since "portable characters" could be implemented as a unique IDL type if desired.) (d) REQT: Support Standard Locales The DCE should support standard locales, if available, from these groups: ISO (highest priority), POSIX, X/OPEN. If none of these group has created standard locales, OSF should provide its own definitions. (e) REQT: Support a Universal Character Set and Encoding OSF should support a single, universal character set and encoding which may be used by any DCE component or application for the transmission of any desired characters. An example of this is the emerging ISO 10646 standard. (f) REQT: Support Character Set and Encoding Independence The DCE should be able to handle a wide variety of character sets and encoding methods, at a very minimum the character sets Thormodsen Page 10 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 and encodings supported by OSF 1.1. DCE components will be required to support interfaces capable of being called with different code sets. It should be possible to compile DCE components into a single object that can support various code sets, both within the workstation and on the network. (g) REQT: Regional Heterogeneous Interoperability The DCE must be able to support networks in which clients and servers are working in different codesets, provided that these different codesets are substantially or entirely interconvertible (i.e., ASCII and U.S. EBCDIC, or SJIS and UJIS) (NOTE: This will depend on either a tagged data type or a universal codeset, so it would be inconsistent to prioritize it above BOTH of those items) (h) REQT: Mechanism for Identification of Character Data OSF should specify a mechanism for allowing DCE components and applications to identify character data with (at least) the identity of the codeset of the data. This identification could be granular, such as by string, character, node, or filesystem. The definition of granularity should be specified per component. (i) REQT: Character Data Tools Tools should be provided for DCE services to use the identification mechanism. (If tools are not available, vendors can provide them, but the group prefers that DCE provide this.) (j) REQT: Support EBCDIC Encodings EBCDIC encoding should be supported (includes single and multi-byte). In the multi-byte case, User-Defined Characters (GAIJI) should be supported. (k) REQT: World-Wide Heterogeneous Interoperability The DCE must be able to support networks in which clients and servers work in different codesets which cannot be interconverted. (NOTE: This implies some useful "fallback" behavior, not a miracle. It will probably depend on the existence of "portable characters", and so should be prioritized accordingly.) (l) REQT: Influence Internationalization Standards OSF should actively participate in standards groups meetings, seeking to influence internationalization standards in ways favorable to the success of OSF DCE. In particular, these Thormodsen Page 11 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 areas are of concern to the DCE: Data tagging, Large character sets, Wide-character processing, Locale naming, Distributed locales. (m) REQT: Permit User-defined Locales In a distributed environment, it must be possible for such locales to be available on both a client and its associated server. This may require manual replication on the part of a system administrator. Such user defined locales must be defined on a network basis. APPENDIX B. DATA INTERCHANGE The following is an in-depth presentation of various architectures for data interchange in an internationalized DCE environment. It is extracted (with minor re-wording) from [SKRS]. B.1. Data Interchange in an Internationalized DCE The Interchange component of the DCE architectural model addresses issues associated with the communication of National Language sensitive data types between the processes comprising a distributed application. More specifically, it addresses what encodings will be used to communicate National Language sensitive information and where in the system conversions are performed when the encodings used locally by the sending process differ from those used locally be the receiving process. Additionally, the interchange architecture specifies how either the sender or the receiver determines that a conversion is necessary at all. The remainder of this section first introduces a set of useful terminology for discussing interchange in the DCE environment and then examines a progression of three DCE interchange environments ranging from simple to very complex. These are not intended to describe every possible variation of interchange support, but rather to define large interesting categories of support and some of the issues associated with each. B.2. Terminology This section introduces a basic set of terminology for discussing the interchange architecture for the OSF DCE. The terms defined here are used through the next three sections. (a) Native DCE implementation Refers to the level of functionality provided by the OSF/DCE 1.0 source offering. This includes straight ports (with no functional enhancements) to other platforms than the reference Thormodsen Page 12 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 implementations. (b) Character Set A collection of characters. (c) Coded Character Set A set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation. (d) Codeset A coded character set that is used to encode all the characters in some locale. (e) DCE Process A process on a given host which is either running a DCE daemon or which is running an application which is using DCE facilities. (f) DCE Logical Network A set of given hosts which are physically connected by some communications media and which are configured and administered to behave as a single, logical DCE system. (g) DCE Portable Character Set The set of semantic characters in DCE R1.0 that are guaranteed to be supported in names within the CDS (Cell Directory Service), the file system, and Security. The set consists of the following 95 characters (note that the space character is included, between the letters and the numerals): [a-z][A-Z] [0-9]!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ This set defines semantic, rather than visual or encoded characters. That means, for example, that a character with the semantics of must be supported, even if in some fonts, the backslash glyph has been replaced with a Yen sign or c-cedilla or other glyph. In addition, it means that just because two glyphs look similar doesn't mean they are the same semantic character. For instance, a double-byte 'A' or a Greek ALPHA appear very similar to the semantic . But these are *not* the same semantic character as the . DCE guarantees to support these 95 characters in names, but it does not prohibit you from using additional characters beyond Thormodsen Page 13 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 the 95. Handling of additional characters is implementation- specific. "Supported" means that it is suggested users exercise discipline in restricting themselves to just these 95 characters (preferably as minimal a subset as they can get away with, such as alphanumerics), but that DCE would not check that they used only these 95 characters. No encoding is defined. Use of IDL-chars and C-(unsigned)-chars is specified. Handling of chars other than these 95 characters is to be unspecified and implementation-specific. (h) Encoding A single coded character set or a defined methodology that allows multiple coded character sets to be combined. The latter case includes well-known rules for determining the single coded character set to which a character belongs. Examples of such rules are ISO 2022 and Compound Text, which include tags or escape sequences to identify the coded character set a character or characters belong to. (i) Network Interchange Encoding (NIE) The encoding used to transfer text strings between DCE processes. The Network Interchange Encoding must be able to encode all characters that may exist in a DCE Logical Network. The set of characters that need to be encoded is defined by the union of all coded character sets by all DCE Processes in a given DCE Logical Network. B.3. The Homogeneous Codeset Environment This environment consists of a single Codeset that is supported by all DCE Processes in the DCE Logical Network. It assumes that a single Codeset is used as the Network Interchange Encoding, and that the same Codeset is used locally by all DCE Processes in a DCE Logical Network (see Figure 1 below). This environment does not specify the specific Codeset to be used. Therefore different DCE Logical Networks could be configured to support different Codesets. The only restriction is that the selected Codeset must be a superset of (contain at least the characters in) the DCE Portable Character Set. (NOTE: This assumes that (1) the OSF DCE source base is 8-bit clean and (2) that it carries no dependencies on any codeset specific characteristics (i.e., like contiguous ranges of characters).) Since all DCE Processes in this scenario recognize a single Codeset, all input and output with these nodes must use only the characters defined in this Codeset. Even if the nodes support other characters outside of this Codeset for environments beyond DCE, those characters would not be supported. This, and the fact that all processes are Thormodsen Page 14 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 ------------------------------------------------------------------------ +-------- All DCE Processes ------+ | support THE SAME | | character set | | | V V +---------------+ +---------------+ | DCE Process| |DCE Process | | +---+ | | +---+ | | | A | | | | A | | | | P | +---+ |NIE=local codeset | +---+ | P | | +------+ | | P | | | | used by all | | | | P | | +------+ | | | | L | | R | | DCE Processes | | R | | L | | | | | Data |<+>| I |<->| P |<+------------------+>| P |<->| I |<+>| Data | | | | | C | A | C | | | | C | A | C | | | | +------+ | | A | | | | | | | | | | A | | +------+ A | | T | | +---+ | | +---+ | | T | | A | | | I | | | | | | I | | | | | | O | | | | | | O | | | | | | N | | | | | | N | | | | | +---+ | | | | +---+ | | | | A | | | | A | | |-----+--+ | | | | +---+-----| | +-------+-------+ +-------+-------+ | | | | | CS=NIE NIE NIE CS=NIE Figure 1. Homogeneous Codeset Environment ------------------------------------------------------------------------ using the same encodings for the characters they use in common, ensures that there is no loss of data when transmitting between the DCE Processes because no translation needs to be done by either node into a different Codeset. This avoids the performance penalties accompanying the conversions and guarantees the integrity of the exchanged character data. As a result, a string entered in one DCE Process and sent to a server in another DCE Process will always consist of the same set of characters when retrieved from a third DCE Process where the three processes belong to the same DCE Logical Network. This environment has the important restriction, however, that it relies on the identical Codeset being supported natively by all DCE Processes in the DCE Logical Network. (NOTE: Where native means that it is supported by the underlying host operating system.) This significantly limits its practical application within the heterogeneous vendor environments being specifically targeted by the DCE. An interesting example of one such heterogeneous environment is one which contains both ASCII and EBCDIC hosts. Thormodsen Page 15 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 B.4. The Homogeneous Network Codeset Environment In the Homogeneous Network Codeset Environment, all DCE processes within a given DCE Logical Network support the same Character Set. Each DCE process may use a different codeset to encode the character set locally, but all DCE processes within the DCE Logical Network use the same Network Interchange Encoding. If the codeset used by a given DCE process is different than the Network Interchange Encoding, then that process is responsible for converting data between its codeset and the Network Interchange Encoding before sending and after receiving data from the network. ------------------------------------------------------------------------ +-------- All DCE Processes ------+ | support THE SAME | | character set | | | V V +------------+ +------------+ | | | | |DCE Process | NIE=Network Codeset |DCE Process | | |<--------------------->| | +----+ | | | | +----+ |DATA|<-->| | | |<->|DATA| +----+ +------------+ +------------+ +----+ A A | A | A A A | | | | | | | | | | V | V | | | | | +-------+ +-------+ | | | | |CS |CS | |CS |CS | | | +--------+ | | | A | | | | A | +--------+ | | V | | | | V | | | | | |NIE|NIE| |NIE|NIE| | | +-------+ +-------+ | | A A | | | Data stored +-------------------------------+ Data stored and processed | and processed using local using local Codeset Each DCE Process which uses Codeset a local Codeset different than the NIE must perform conversions between its local Codeset and the NIE. Figure 2. The Homogeneous Network Codeset Environment ------------------------------------------------------------------------ This environment represents a superset of the interchange support provided by the Homogeneous Codeset Environment. Each DCE Process Thormodsen Page 16 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 may use a different code set and thus enables ASCII and EBCDIC based DCE Processes to coexist and exchange character data within a single DCE Logical Network. At the same time, it maintains data integrity by requiring all DCE Processes within a given DCE Logical Network to support the same set of characters. Therefore, if the DCE components themselves are properly internationalized and do not introduce any restrictions beyond the interchange architecture defined here, a given DCE Logical Network could be configured to support any arbitrary character set which is a superset of the DCE Portable Character Set. From a performance perspective, this environment can require up to two conversions to be performed on a single transmission between DCE processes if both processes are using a codeset which is different than the Network Interchange Encoding. Note that this is true even in the case where the DCE processes are using the same codeset. Because all components of the base OSF/DCE assume that all character data passed to them via the RPC mechanism is already encoded in the codeset used by the DCE process on which the implementation is running, this environment cannot be supported by a native DCE implementation. Architecturally, there are two approaches which may be taken to enable the support of this environment which are distinguished primarily by where the responsibility for performing the conversions between the Network Interchange Encoding and the codeset lies. The first approach, depicted in Figure 3 below, places the responsibility on the RPC layer of the system. To support this approach, the RPC mechanism must be modified to be sensitive to the codeset and the Network Interchange Encoding and must perform conversions between the two when they are different. (NOTE: When I am referring to RPC here I am lumping the processing performed by the stubs and the runtime together into a single layer. This is done to simplify the discussion and is accurate to the degree that it is not the application programmer who is worrying about the conversions.) This approach has several advantages which are derived primarily from the fact that it maintains the RPC paradigm for making differences between basic data type representations on communicating systems transparent to the application programmer. In particular, it would not require modifications to the DCE components themselves. (NOTE: This assumes that the DCE components have already been "internationalized".) In the second approach, depicted in Figure 4 below, places the responsibility for conversion on each application. This approach has the advantage of providing complete flexibility to the application programmer in choosing how and when the conversion is to be performed. However, it comes at the cost of requiring all Thormodsen Page 17 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 ------------------------------------------------------------------------ +-------- All DCE Processes ------+ | support THE SAME | | character set | | | V V +-------------+ +-------------+ | DCE Process| |DCE Process | |+---+ | | +---+| || A | | | | A || || P | | | | P || || P | | | | P || || L | +---+ | | +---+ | L || || I | | R | | NIE=Network Codeset | | R | | I || || C |<>| P |<+---------------------+>| P |<>| C || || A |A | C | | | | C | A| A || || T || +---+ | | +---+ || T || || I || |A | | |A || I || || O || || | | || || O || +----+ || N || || | | || || N || +----+ |DATA|<-->|+---+| || | | || |+---+|<->|DATA| +----+ +-----+---++--+ +--++---+-----+ +----+ A A | |+--+ +---+| | A A | | | | | | | | | | | | | V | V | | | | | | | +-------+ +-------+ | | | | | | |CS |CS | |CS |CS | | | | +--------+ | | | | A | | | | A | | +--------+ | | | V | | | | V | | | | | | | |NIE|NIE| |NIE|NIE| | | | | +-------+ +-------+ | | | | | | | | | | Data stored | Data passed to applications in | Data stored and processed +-- the local Codeset for the DCE --+ and processed using local Process within which they are using local Codeset running. Codeset Figure 3. RPC Based Architecture for Homogeneous Network Codeset Support ------------------------------------------------------------------------ application programmers to deal with the added complexity of determining what the Network Interchange Encoding is and explicitly performing the appropriate conversions. Additionally, this approach does not require any changes to the existing RPC mechanism itself. Both of these approaches will work and both of them require changes to some portion of the existing DCE system in order to allow the DCE to provide this level of interchange support. However, the first approach is more aligned with the DCE's objectives of providing a Thormodsen Page 18 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 ------------------------------------------------------------------------ +-------- All DCE Processes ------+ | support THE SAME | | character set | | | V V +-------------+ +-------------+ | DCE PROCESS| |DCE PROCESS | |+---+ | | +---+| || A | | | | A || || P | | | | P || || P | | | | P || || L | +---+ | | +---+ | L || || I | | R | | NIE=Network Codeset | | R | | I || || C |<>| P |<+---------------------+>| P |<>| C || || A | A| C | | | | C |A | A || || T | |+---+ | | +---+| | T || || I | | | | | | I || || O | | | | | | O || +----+ || N | +------+---------------------+------+ | N || +----+ |DATA|<-->|+---+ | | | +---+|<->|DATA| +----+ +--+A---------+ | +--------A+---+ +----+ A A |+--+ | +---+| A A | | | | | | | | | | | V | | V | | | | |+-------+ | +-------+| | | ||CS |CS | | |CS |CS || | +--------+| | | A | | | | | A |+--------+ | | V | | | | | V | | | | | |NIE|NIE| | |NIE|NIE| | | +-------+ | +-------+ | | | | | | Data stored Data passed to applications in Data stored and processed encoded in the Network Codeset. and processed using the local Applications are responsible for using Local Codeset converting to the local Codeset for processing. Figure 4. Application Based Architecture for Homogeneous Network Codeset Support ------------------------------------------------------------------------ platform for developing distributed applications in a heterogeneous environment where the platform isolates the application programmer from as much of the complexity of the environment as possible. Thormodsen Page 19 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 B.5. The Heterogeneous Environment In the Heterogeneous Environment, all DCE Processes support the DCE Portable Character Set, but there is no single homogeneous character set which is shared across all DCE Processes in the DCE Logical Network (see Figure 5). Each DCE Process may support its own character set(s) so long as it is a superset of the DCE Portable Character Set and may use its own codeset which may be different than the Network Interchange Encoding. Like the Homogeneous Network Codeset Environment, if the codeset used by a DCE Process is different than the Network Interchange Encoding then that process is responsible for converting its data to the the Network Interchange Encoding. Yet, the difference is that since different character sets may be used by different DCE Processes, there is no guarantee that data can be communicated except for the intersection of character sets from the communicating DCE Processes. This environment is the first that has been examined in this document which introduces potentially significant data integrity problems. Specifically, data integrity can not be guaranteed between DCE Processes that have incompatible character sets. Only those characters in common at a given time (between requester/sender) are guaranteed to be correct. Therefore, any system interchange architecture which aims to provide this level of support must address the case where a piece of data arrives at a DCE Process for which the DCE Process has no defined conversion to its local representation. The architecture must define which party is responsible for detecting the potential data loss, how and if that loss is communicated to the end user, and in what cases does the intended operation proceed in spite of the data loss, etc. This environment presents several issues relating to how data is passed through the DCE Logical Network and the integrity of the data being communicated. Specifically, there are three approaches to building a Heterogeneous Environment: (a) Single Network Interchange Encoding (Canonical form) (b) Sender makes it right. (c) Receiver makes it right. (For both (b) and (c), it is assumed that there is some implementation method for cooperating DCE Processes to communicate the codeset of the information they are exchanging.) Thormodsen Page 20 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 ------------------------------------------------------------------------ +-------- Each DCE Process may ------+ | support a DIFFERENT | | character set | | | V V +------------+ +------------+ | | | | |DCE Process | NIE = (see text, B.5) |DCE Process | | |<--------------------->| | +----+ | | | | +----+ |DATA|<-->| | | |<->|DATA| +----+ +------------+ +------------+ +----+ A A | A | A A A | | | | | | | | | | V | V | | | | | +-------+ +-------+ | | | | |CS |CS | |CS |CS | | | +--------+ | | | A | | | | A | +--------+ | | V | | | | V | | | | | |NIE|NIE| |NIE|NIE| | | +-------+ +-------+ | | A A | | | Data stored +-------------------------------+ Data stored and processed | and processed using local using local Codeset Each DCE Process which uses Codeset a codeset that is different than the NIE must perform conversions between its local Codeset and the NIE. Figure 5. A Heterogeneous Environment ------------------------------------------------------------------------ B.5.1. The single network interchange encoding In this environment, there is a single Network Interchange Encoding defined for interchange in the DCE Logical Network. The selection of the Network Interchange Encoding must be defined such that it can encode the union of character sets that may be supported by the DCE Logical Network. The Network Interchange Encoding may be defined as either a tagging mechanism or as a large character set encoding that includes the union of all character sets found in the DCE Logical Network. Note that if a tagging mechanism is used as the Network Interchange Encoding then the initial state of the encoding could be defined to correspond to a primary character set(s) of the DCE Logical Network. Thormodsen Page 21 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 For example, the initial state could be defined to be either the Latin-1 (ISO8859-1) or the JIS (JISX0201) character set. Such a strategy would eliminate the need to convert unless the data being sent is not contained in the initial state. This optimizes the default case yet allows other DCE Processes whose character set(s) are not included in the initial state to exist in the DCE Logical Network. If the Network Interchange Encoding chosen is based on a large character set (Unicode or ISO10646) this would require conversion of all systems that do not support this as their coded character set. Yet, it would allow DCE Processes that support the large character set as their codeset to migrate to this environment and bypass the conversion on communication. B.5.2. Receiver makes it right In this environment, there is no single Network Interchange Encoding defined for the DCE Logical Network, but rather each DCE Process sends data using its own code character set as its Network Interchange Encoding. The receiving DCE Process will have negotiated to accept the codeset of the sending DCE Process. In the case where a DCE Process receives data that is encoded using a different codeset than it is using locally, it is the receiving DCE Process's responsibility to perform the conversion from the sending DCE Process's codeset to codeset being used locally. This implies that each receiving DCE Process needs to provide a converter for each codeset that may exist in the DCE Logical Network. This may lead to N converters being available to each DCE Process where N is the number of codesets existing in the DCE Logical Network. This approach does eliminate the need to convert if both DCE Processes are using the same codeset. Furthermore, the negotiation procedure is reduced to the receiver either being able to perform the conversion from the sender's Codeset to its Local Codeset or not. B.5.3. Sender makes it right This environment is a derivative of the previous "Receiver Makes it Right" with the exception that any conversion is the responsibility of the sender. Additionally, the negotiation process in this model must be more sophisticated as it requires the sender to obtain knowledge of the codeset that the receiver would like to receive the data in before the data is actually sent. Thormodsen Page 22 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 ------------------------------------------------------------------------ +------------+NIE1=CS of DCE Process1+------------+ | |---------------------->| | |DCE Process | |DCE Process | | 1 | | 2 | +----+ | |NIE2=CS of DCE Process2| | +----+ |DATA|<-->| |<----------------------| |<->|DATA| +----+ +------------+ +------------+ +----+ Figure 6. Receiver Makes it Right ------------------------------------------------------------------------ B.6. Current RPC Facility Because of dependencies within the individual DCE components on the ordering of code points defined by the ASCII family of codesets, it is assumed that Native DCE implementations based on OSF/DCE 1.0 can only support the Homogeneous Codeset Environment described in section A.1.1. However, the existing RPC facility does provide the capability to support an interesting subset (or special case) of the Heterogeneous Environment (see section B.1.4) through its definition of "ASCII" and "EBCDIC" tagging. This special case can be defined by applying the following restrictions to the Heterogeneous Environment definition. (a) The number of Network Interchange Encoding which can be used in a given DCE Logical Network is restricted to precisely 2. One of these must be "ASCII" based and the other "EBCDIC" based. (Heterogeneous Environment) (b) Each DCE Process in the DCE Logical Network must use one of the Network Interchange Encodings as its codeset. The character set is the same for both Network Interchange Encodings as well as the codesets used within the DCE Processes. (Homogeneous Network Codeset Environment) (c) Each DCE Process is responsible for converting data received from a DCE Process using a different codeset than what is being used locally. (This requires a given DCE Process to maintain only one conversion table which provides the mapping from the other Network Interchange Encoding to the one being used locally.) REFERENCES [Klin] S. Kline, "OSF I18N SIG Generic Requirements", I18N SIG paper, August 21, 1992. Thormodsen Page 23 DCE-RFC 13.0 DCE 1.1 I18N Requirements August 1992 [Ogur] T. Ogura (with S. Martin), "DCE 1.1 I18N Workbook", Preliminary Draft, June 5, 1992. [This is an OSF internal investigation report, available to DCE licensees only.] [SKRS] S. Snyder, H. Kushki, F. Rojas, E. Stokes, "Internationalization in the OSF DCE, A Framework", DCE SIG working paper, May 22, 1991. AUTHOR'S ADDRESS Arne Thormodsen Internet email: arnet@cup.hp.com CSO Internationalization Telephone: +1-408-447-4798 Hewlett-Packard Co. 19447 Pruneridge Ave. Cupertino, CA 95014 USA Thormodsen Page 24