OSF DCE SIG Sue Kline (HP) Request For Comments: 39.0 Alex McLeod (IBM) Makoto Nishino (IBM) David Obermann (IBM) Francis X. Rojas (IBM) Arne Thormodsen (HP) March 1993 AN INTERNATIONALIZED DCE CHARACTER HANDLING PROPOSAL -- INTERCHANGE OF CODED CHARACTERS CONVENTIONS AND MECHANISMS 1. INTRODUCTION In today's global computer marketplace, there are numerous character set encodings available that are designed for supporting various national and international market sectors. Many of these codesets are based on standards, some have become industry "defacto" standards due to their acceptance within a marketplace, while still others are vendor proprietary. To be successful in the global marketplace, DCE must provide a framework whereby applications can interoperate in regional and global networks, even though differing codesets may be in use. In other words, DCE should provide the enabling services to properly transport encoded character data from one system to another and the enabling facilities for the receiving system to decode the data into an acceptable representation for local processing. 1.1. Abstract This paper proposes to: (a) Enable the interchange of data within a global heterogeneous networking environment. (b) Enable the interchange of data within a large variety of existing regional, relatively homogeneous, environments. (c) Define the conventions to be used to interchange coded character data of different encodings. (d) Provide solutions for the DCE 1.1 timeframe in addition to looking at possible longer term architectural solutions. (e) Provide solutions that retain backwards compatibility with existing DCE applications and do not modify the current RPC protocol. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 1 DCE-RFC 39.0 DCE I18N Character Handling March 1993 1.2. Problems with Existing DCE Support for Character Data DCE was not originally designed with provisions to support the vast number of character encodings that exist in the global marketplace. The concept of a "char" within IDL includes inherent restrictions to ASCII and (a single codepage of) EBCDIC, thereby precluding support for all other code-sets. This existing IDL "char" behavior must continue to be preserved to ensure that existing DCE applications are not broken. However, in order for DCE to be successful in the global market, DCE must also be able to support other (non-ASCII/non-EBCDIC) encodings. 1.3. Regional Interoperability Requirements Regional interoperability requires that a solution be provided whereby systems within the network can successfully interchange character data that is in the same language, but may or may not use the same character set or encodings. In this environment, the number of character sets and encodings is small, and they are well-known across the network. In these cases it is possible to define conversions that result in little or no data loss under most circumstances. Examples of such cases are numerous throughout the world, but to name a few specific "must-solve" cases: (a) Japan -- Shift-JIS and EUC encodings of Kanji characters. (b) Western Europe -- ISO-88591 standard character set and a wide variety of proprietary ones. Ideally, a regional interoperability solution will allow for the most efficient transfer of characters if a single codeset is in use throughout a network. The emphasis is to avoid unnecessary data conversions at either the data sender or receiver ends. For example, it should be possible to transfer data between two identical machines using the same local encoding (either standards-based or vendor proprietary) without any data conversion. When conversions are necessary within a regional network, they should be optimal conversions. A case in point example is the Japanese market, where Shift-JIS and EUC conversions are commonplace within applications today and consequently are highly optimized. 1.4. Global Interoperability Requirements Global interoperability requirements involve the transfer of data between systems which may have no direct ability to process or convert the character sets and encodings used on other systems. The character sets may be designed to support different languages, and so only share a small subset of common characters, or perhaps no characters at all. In some cases proprietary sets may be in use, and only the local system they are used on can convert from these encodings to other, more widely used, encodings. The best example of Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 2 DCE-RFC 39.0 DCE I18N Character Handling March 1993 these problems are found in a heterogeneous, global network of systems, such as is evident in most multi-national corporations. Global interoperability entails a more sophisticated solution than in the regional case. DCE must be able to handle any character encoding that may be in use, and also identify those cases where interoperability is not possible (for example, when two entirely non-intersecting character sets are in use on different systems). One single, universal, encoding must be provided by DCE for on-the- wire representation of character data. This allows interoperability between systems which otherwise cannot convert directly to each other's encodings. It also allows systems which use encodings not in use on a given network to be integrated into the network anyway, provided that they convert to and from this universal encoding when accessing other systems. A possible trade-off of optimal performance in favor of functionality may occur in order to fulfill the global interoperability requirements, as conversions to/from this universal data representation would be required in most cases. Also, as mentioned, there will be cases where interoperability cannot be achieved. These cases must be reliably identified, preferably before client-server communication is attempted. 1.5. Overview of this Proposal This proposal provides solutions for addressing both the regional and global problem areas by focusing on enhancements within the RPC component and on related RPC-based services. The remainder of the paper is broken into the following Sections: (a) Terminology used in this paper. (b) A description of the functionality and implementation requirements. (c) A description of various proposed codeset conversion models. (d) A description of proposed extensions to the DCE NSI and some associated support API's. (e) A description of a proposed codeset identification scheme and some associated support API's. (f) Dependencies and assumptions. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 3 DCE-RFC 39.0 DCE I18N Character Handling March 1993 2. TERMINOLOGY The following terminology will be used throughout the remainder of this proposal: (a) "character set" -- A collection of symbols used to represent information, typically a written human language. (b) "codeset" -- A character set with assigned numeric codes. (c) "codeset_context" -- A set of information including the local codeset of a given client/server and the set of related codeset converters locally available. This is explained in more detail in Section 5.7. (d) "codeset tag" -- Identifies the codesets involved in the request/reply round trip. The codeset tag consists of two parts, the "transmit_id" and the "response_id". This tag is also referenced as a "codeset_t" type in this paper. (e) "cs_type" -- The transmit_id and response_id (below) are of type "cs_type". (f) "local codeset" -- Whatever character type and encoding that is in use on a given client/server. The possibilities include single-byte, multibyte and wide character types. (g) "reply" -- A data transmission issued by the server in answer to a specific client request. (h) "request" -- A data transmission issued by the client to a server. (i) "response_id" -- The expected code set of the next incoming data transmission. (j) "transmit_id" -- The code set of the current outgoing data transmission. 3. DESCRIPTION OF REQUIRED FUNCTIONALITY In order to satisfy the goals outlined in Section 1.0, several enhancements or additions to the existing DCE are needed. These additions include extensions to the RPC to support codeset identification and conversion, runtime library support for this extended functionality, and name service (NSI) extensions to allow clients to determine the codesets supported by a server. Below are a high-level description of the requested functionality, followed by a lower-level description of the implementation Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 4 DCE-RFC 39.0 DCE I18N Character Handling March 1993 requirements to implement this functionality. 3.1. High-level View of New DCE I18N Functionality One of the major goals of this proposal is to help clarify whether a given feature is expected to be implemented as a feature of the IDL compiler, to be provided as a library function with the DCE product, or to be implemented by actual changes or additions to the logic of an application. This proposal attempts to minimize this last item. Our goal is to permit a distributed application to be internationalized to some useful degree with only changes to the interface definition. At the same time we recognize that some functionality can only be implemented by making changes in an application, for example by changing the logic used by a client to select a server. Therefore, as a guideline to implementors, it is indicated if a given area of functionality should be implemented by changes in one or more of the following areas: (a) "APP" -- The actual client and/or server implementations ("manager code"). (b) "STUB" -- Logic flows in the IDL-generated stub code. (c) "LIB" -- Runtime support library functions, may be called from application code or stub code. The areas of new functionality required are: (a) A mechanism for clients and servers to determine their own local codesets, and a mechanism to make this information visible to the RPC stub code. It is not anticipated that more than two codesets (one for each "end") will need to be supported per client/server connection. (APP, STUB, LIB) (b) A mechanism, supported via the DCE NSI, for a client to determine if a given server supports particular codesets. See Section 5.0 for more details. (APP, LIB) (c) A specific protocol for a client to indicate to a server, and for a server to indicate to a client, what codeset is in use "on-the-wire". (STUB, LIB) (d) A specific protocol which allows a client optionally to indicate to a server how and where conversions should be handled. In particular, support for at least the conversion models discussed in Section 4 should be provided. (APP, STUB, LIB) Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 5 DCE-RFC 39.0 DCE I18N Character Handling March 1993 (e) A mechanism to do actual codeset conversions. (LIB) (f) A mechanism for clients to handle codeset conversion errors which occur either locally or at the server. This support should be integrated into existing RPC error-handling mechanisms. (APP, STUB, LIB) 3.2. Implementation Requirements The functionality outlined above translates into several specific implementation requirements. These are listed below, with no prioritization: (a) Character data parameters appearing as scalars, fixed-length arrays and null-terminated strings, must all be handled properly. Characters of various sizes, for example ISO 8859 8-bit characters, UNICODE 16-bit characters and ISO 10646 32- bit characters, must all be handled properly. (NOTE: This paper does not address the issue of different character data sizes being used on clients and servers. This is not an internationalization issue, and handling such conversions is outside of the scope of the basic DCE RPC mechanism.) (b) Character conversions which result in a change in the number of character elements in a data structure must be handled properly. Note that in certain cases (i.e., character data embedded in a fixed-size structure) generation of an error may be the appropriate action. (c) The protocol used for clients and servers to indicate their local codeset to each other should not be visible at the level of an RPC call within an application. (d) The mechanism used for clients and servers to do actual codeset conversions (including data size issues, as in item B above) should not be visible at the level of an RPC call within an application. (e) Client and server applications must have a mechanism to indicate to each other what local codesets are in use, and what encoding is being used for "on-the-wire" data. A specific protocol must be provided for indicating the identity of codsets (see Section 6). OSF should supply a process for the provision of these id's. This process must provide specific identifiers for the codesets in Section 7.3.2 as well as other sets found to be of general interest to the DCE community. It must also allow for private user and vendor-specific extensions. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 6 DCE-RFC 39.0 DCE I18N Character Handling March 1993 (f) A single server must be able to communicate with clients which support different conversion models (see Section 4). (g) The DCE source, as delivered, must define a set of internal interfaces that can be used by implementors to allow access to their own set of interoperability routines (i.e., codeset converters, conversion policy controls, etc.). The DCE source must also provide at least one set of interoperability routines which may be used "as-is" by DCE applications, and which support the conversion models described in this proposal (see Section 4). At a minimum, these should support the OSF/1- supported codesets as listed in Section 7.3.2. (NOTE: Conversions to/from the above mentioned codesets and a UCS encoding must be provided by DCE independently of the underlying OS. This is because these conversions must be supported in all implementations of the DCE 1.1 to assure conformance to the interoperability goals of this proposal.) (h) A mechanism must be provided by DCE, presumably through the name services, to allow a client to determine the codesets and conversions supported by a server. Furthermore, the client should be able to accept/reject binding to a server based on this information. This facility, which is intended to provide a method of ensuring data integrity for client-server connections ("dynamic model"), is discussed in Sections 4.4 and 5.0. (i) A method must be provided to allow servers which support this new internationalized behavior to be backwards compatible with older clients. This will presumably be handled through the provision of multiple interfaces to a given server. (j) A mechanism must be provided for conversion errors at both client and server to be reported to a client. Such errors may include (but are not limited to): unknown character encoding, invalid character encoding, and buffer overflow during conversions. (NOTE: This proposal does not address the error handling issue as it was felt that it is highly implementation dependent). 4. DESCRIPTION OF PROPOSED CONVERSION MODELS Three proposed conversion models are discussed in this Section: Universal Character Set (universal), Receiver Makes it Right (RMIR) and Dynamic (dynamic). Three models were found to be needed in order to accommodate the various efficiency and functionality concerns which arose while forming this proposal. The particular model is determined by the nature of the client. Only one kind of server is Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 7 DCE-RFC 39.0 DCE I18N Character Handling March 1993 described here, it will support all conversion models. This last point is important to note. The conversion models are determined not by the codeset converter itself, but by the values of the codeset tags (see below). This important feature allows for a great deal of flexibility, and extensibility, within a simple framework. Conversion models other than the ones discussed below are possible. These models below were selected because they seemed to be the simplest set which could cover all of the design criteria. Since the models are implemented in libraries external to both application and stub code, other models are possible within the framework of this proposal. Key to all of these models is the codeset tag, which consists of two parts as illustrated in the following diagram: +---------------------+--------------------+ codeset tag: | transmit_id | response_id | +---------------------+--------------------+ The "codeset tag" is a conceptual model for the information which must be passed between a client and a server to implement the models described below. This tag consists of two parts, each of type "cs_type". The "transmit_id" indicates the codeset of the current outgoing transmission (assuming there is one) while the "response_id" indicates the desired (and expected) codeset of the next incoming data transmission. In most cases these values will be the same, however, as discussed below, the response_id may take on the special value "no value", to indicate that the client has no preferred response codeset (i.e., it assumes it can convert anything it receives to it's own local codeset). It is the intent of this proposal that this tag, in whatever form it takes in an actual implementation, not be visible at the level of an RPC call. It should only appear in stub-stub communications, and should only be specified in ".idl" (and/or ".acf") files. To illustrate, assume that a client is sending requests in codeset "X", and requesting replies from the server in codeset "Y" (not a highly realistic situation). The codeset tag for outgoing, client- generated request data would then be: client +---------------------+--------------------+ codeset tag: | X | Y | +---------------------+--------------------+ Correspondingly, the server in this case would be transmitting the reply back to the client in codeset "Y", and requesting data from the client in codeset "X". The codeset tag for all outgoing, server- Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 8 DCE-RFC 39.0 DCE I18N Character Handling March 1993 generated replies to the client would be: server +---------------------+--------------------+ codeset tag: | Y | X | +---------------------+--------------------+ (NOTE: The "response_id" may be set to "no value", indicating that the transmitting side does not have a codeset preference for data it will receive. This is used by the "RMIR" conversion model, see Section 4.3.) This model is conceptual only, for efficiency any actual implementation should minimize where possible the amount of tag data exchanged between clients and servers. As an example, for a "[in]" only parameter operation, there is clearly no need for the server to send back any tag information, since in this case, no character data moves from the server to the client. Similarly, there is no need, in most cases, for the server to send back a "response_id" unless it is known that more data will be sent from the client, and that it is possible for the client to convert to the server's preference. These and similar issues are not discussed below, for the sake of simplicity. Also, note that in all cases the codeset "conversions" described may be "no-op" conversions, if the on-the-wire encoding is the same as the local codeset. This situation is not described as a special case in any of the models. 4.1. Determination of Conversion Models There are three conversion models discussed below. A natural question which may arise is how the client determines which model to use. In the case of the "dynamic" model (Sec 4.5) the actual interaction is determined at runtime. However for the other two models there are several possibilities: (a) The conversion model may be statically determined at the "IDL" compile time by specifying attributes. (b) The conversion model may be determined at runtime by some information external to a client, such as a configuration file or environment variable. (c) The conversion model may be indicated by the client via an API which sets some global value which is in turn used by the stub code to determine the model. Each of these approaches has advantages and disadvantages. We were unable to come up with a clear answer as to which might be the best. All three approaches would work, and all three can be made extensible to new conversion models (although in case 'a' this will require a Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 9 DCE-RFC 39.0 DCE I18N Character Handling March 1993 well-conceived design on the part of the IDL developers). This issue needs to be resolved during the design and implementation process. 4.2. Universal Character Set Model -- 'universal' The Universal Character Set (UCS) model is the simplest model to understand and implement, but necessarily not the most efficient. In this case, local codesets are always converted to a UCS encoding before transmission. Despite the possible performance impacts, this model is essential; it provides the only way to integrate a client which supports a particular codeset with a server which does not. A high-level description of this model is below. (a) CLIENT: (i) Any character data sent to the server is converted from the local codeset to the UCS encoding prior to transmission. (ii) Both id fields in the codeset tag are set to the "UCS" id. (iii) Any character data returned from the server is converted from the codeset identified (which will be UCS) to the local codeset. This conversion (UCS to local) is required to be available to the client. (iv) Any errors encountered while converting are returned to the client. (b) SERVER: (i) Converts any client character data from the codeset specified by the client's "transmit_id" (which will be UCS) to the local codeset. As with the client, this conversion to the local codeset must be guaranteed. (ii) Any character data returned to the client is converted from the local codeset to the UCS encoding. (As with the client, both id fields in the codeset tag will be set to the UCS id.) (iii) Any errors encountered while converting are returned to the client. 4.3. Receiver Makes it Right Model -- 'RMIR' The Receiver Makes it Right (RMIR) model is closely analogous to the existing conversion mechanism which the DCE RPC uses for integer and floating point data types. It is expected to be the predominant mechanism in a well-characterized network of similar-powered machines Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 10 DCE-RFC 39.0 DCE I18N Character Handling March 1993 supporting a limited number of codesets. This situation is encountered, for example, in the US, Western Europe and Japan with well-characterized networks of PCs and workstations. The RMIR model performs the minimum number of conversions, and distributes these evenly across all communicating machines. (a) CLIENT: (i) Any character data sent to the server is sent in the local codeset, and identified in the client's "transmit_id". Any character data returned from the server is converted from the codeset indicated by the server's "transmit_id" to the local codeset (this conversion may fail if the appropriate converter is not available). (ii) Any errors encountered while converting are returned to the client. (b) SERVER: (i) Attempts to convert from the codeset specified by the client's "transmit_id" to the local codeset (this conversion may fail if the appropriate converter is not available). (ii) Any reply data back to the client is sent in the local codeset with the "transmit_id" of the reply indicating the local codeset identifier. (iii) Any errors encountered while converting are returned to the client. (NOTE: The "response_id" of both client and server codeset tags is always "no value" for the RMIR model. Also, the converters at either end is guaranteed at least to be able to convert the UCS encoding to the local codeset, allowing UCS clients and/or servers to always interoperate under this model) 4.4. Dynamic Model -- 'dynamic' The dynamic model attempts to perform conversions in the most efficient manner for all cases. It is especially suited to asymmetric cases, such as many clients accessing a single server. In this particular case, this model will ideally result in clients doing all necessary conversions, thus offering the highest server performance. The dynamic model depends on the client having access to the server's local codeset and conversion capabilities as part of the binding process. A determination can then be made as to which conversion Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 11 DCE-RFC 39.0 DCE I18N Character Handling March 1993 policy should be utilized at RPC runtime (see Section 5.0). Because of this feature, it is impossible to describe a definite data flow between client and server, as was possible for the two models above. This model is discussed in detail in the following Subsections. 4.4.1. Rationale for dynamic model The dynamic model's prime purpose is to determine the best codeset for a specific binding. All operations within an interface are expected to share the results of this determination; hence, the determination need only occur once per binding. In order to provide enough information so that the data conversion can be fully optimized, the following information is needed: (a) Codeset information to identify both the client's and the server's respective local codeset. (b) Conversion capability information to identify what type of optimizations can be done respectively by the client and server. The dynamic model clearly must arrive at no conversions in a homogeneous network. At the same time, the goal is to optimize the conversions in heterogeneous environments for all the situations described in the table which follows. In this table 'X' and 'Y' refer to two different codesets, and 'knows' implies the ability to convert to the local codeset. +---+------+------+------+------+---------------------------+ | |client|client|server|server| note | | |using |knows |using |knows | | +---+------+------+------+------+---------------------------+ | 1 | X | X | X | X | using same one. | | 2 | X | X,Y | Y | X,Y | both have X<->Y converter | | 3 | X | X,Y | Y | Y | only client has X<->Y | | 4 | X | X | Y | X,Y | only server has X<->Y | | 5 | X | X | Y | Y | neither has X<->Y | +---+------+------+------+------+---------------------------+ Given all of these possible situations, the dynamic model needs to arrive at the optimal conversion policy for each case. There are several possibilities for conversion policy: Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 12 DCE-RFC 39.0 DCE I18N Character Handling March 1993 +-----------------+ +--------------------+ | Configuration | | Dynamic Model | +---+------+------+ +--------------------+ | |client|server| | Possible Models | | |knows |knows | | A | B | C | +---+------+------+ +------+------+------+ | 1 | X | X | | HOMO | HOMO | HOMO | | 2 | X,Y | X,Y | | SMIR | RMIR | CMIR | | 3 | X,Y | Y | | CMIR | CMIR | CMIR | | 4 | X | X,Y | | SMIR | SMIR | SMIR | | 5 | X | Y | | UCS | UCS | UCS | +---+------+------+ +------+------+------+ Key: HOMO := Homogeneous network, no conversion RMIR := Receiver makes it right conversion SMIR := Server makes it right conversion CMIR := Client makes it right conversion UCS := Universal Character Set conversion After careful analysis, model 'C' was found to be the best choice. While the analysis is not included in this discussion, it is safe to state that the main reason for choosing 'C' is that it offers (on average) the best server performance by offloading conversions to the client whenever possible. This issue is of importance in configuration "2" above, where multiple possible conversion models exist. 5. MECHANISM FOR EXCHANGE OF CODESET AND CONVERTER INFORMATION For the dynamic model to work optimally, the client must know the codeset and the conversion capabilities of the server (e.g. the server's codeset_context). While there are various way to do this, this proposal suggests that extensions be provided to NSI to allow the server to announce it's local codeset and set of related codeset converters. Once the binding between a given client and server occurs, the conversion policy information, as determined as part of the binding process, should be made available to the RPC runtime via the server's binding handle. For each binding handle being used by a client for an RPC, the client queries the NSI to obtain the server's codeset_context. The client then makes decisions using this information in conjunction with its own local codeset_context to determine the on-the-wire codeset for both directions. The client must append the derived conversion policy as specified via a transmit_id and response_id (codeset_t) to the server binding handle. This allows the RPC runtime to be informed of the conversion policy decision made at binding time. (NOTE: In this proposal the binding handle is used to store the conversion policy information because it is available, persistent, Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 13 DCE-RFC 39.0 DCE I18N Character Handling March 1993 and unique per RPC binding. Other implementations are possible, but are not discussed here.) The next few Sections outline a set of extensions to the DCE NSI, along with some associated API's, which could be used to provide such support. For the sake of clarity these Sections discuss specific API's. It is expected that the actual implementation may depart from these specific API's, however the information discussed below must somehow be available to a client. 5.1. Enhancements to NSI to allow the optional exportation of server codeset and available codeset converter information to NSI Enhancements to NSI are requested to facilitate the exportation of the server's codeset and set of available codeset converters. This information will be used to support the "dynamic" model only. It is useful for those applications which wish to ensure successful character data transfer as part of the client/server binding process. In some cases, only the client will be able to convert it's codeset to the server's codeset; other cases exist where only the server can convert. To enable a decision as to the optimal conversion policy, the following enhancements must be provided: (a) At the time when the server is advertised to NSI, the server must be able to export it's codeset_context to NSI. (See Section 5.2.) (b) NSI must be enhanced to accommodate storing the exported server information regarding the server's codeset and related set of codeset converters (together forming the "codeset_context"). (See Section 5.7.) (c) When the client is engaged in selecting a suitable server to bind with, the client must be able to query NSI to obtain the server's exported codeset_context. (See Section 5.3.) (d) If the server's codeset_context information is available, a determination is made by the client as to whether the client and server can bind, based on their respective local codesets and set of supported converters, and from that decision, what conversion policy is appropriate. (See Section 5.4.) (e) If an acceptable server is found, the server's binding handle must be updated to reflect the derived conversion policy, which will be used in the subsequent (specified) RPC calls. (See Section 5.5.) To implement these functions, the following set of API's are proposed. (NOTE: The following API names are only suggestions. Actual names and the type/number of parameters depend on the implementation of NSI enhancements.) Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 14 DCE-RFC 39.0 DCE I18N Character Handling March 1993 5.2. void rpc_ns_extensions_codeset_add() This API allows the server to export it's codeset and the list of codeset converters to NSI. The API will utilize server's codeset_context obtained from a call to "rpc_local_inq_codeset()" (see Section 6.3). Specifically, the following information will be exported to NSI: (a) Value of the server's codeset, identified by a (cstype) codeset value. (b) Number of related converters supported by the server. (c) Each converter, identified by a (cstype) codeset value. 5.3. void rpc_ns_extensions_codeset_inq() This API allows the client to query NSI to obtain the server's codeset and related set of codeset converters. This API will retrieve the codeset_context exported to NSI by the server via the above mentioned "rpc_ns_extensions_codeset_add()" call. This information includes: (a) Value of the server's codeset, identified by a (cstype) codeset value. (b) Number of related converters supported by the server. (c) Each converter, identified by a (cstype) codeset value. This API will allocate client-side storage at runtime to store the server-side information obtained and return a pointer to this storage to the caller. 5.4. void rpc_local_resolve_encoding() This API allows the client to resolve if a binding with a given server can occur based on the codeset_context of the client and server, respectively. This API will utilize the client's codeset_context obtained via a call to "rpc_local_inq_codeset()" and server's codeset_context that was obtained by the previously described RPC call to NSI "rpc_ns_extensions_codeset_inq()". The function of this API is to resolve whether the client or the server is capable of carrying out any required codeset conversions and establish the conversion policy for subsequent RPC operations between this client/server. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 15 DCE-RFC 39.0 DCE I18N Character Handling March 1993 This routine accepts a "resolution_level" parameter, which indicates the granularity of resolution required by the client. It will return a status (success or failure) and appropriate codeset tags settings in the case of success through pointers. If the codeset_context of the client and server match, then this routine will always return a success status. Three resolution levels are suggested for an application to indicate the desired action when the codeset_contexts do not match. (NOTE: The numeric level values presented are purely arbitrary and are used for proposal purposes only.) (a) Level 0 -- Indicates that if the client and the server are not using the same codeset then a status should be returned indicating that no binding is possible. A no binding status may also be returned if (unspecified) error conditions exist. Level 0 is intended for those applications which require an absolute guarantee of data integrity and would rather not bind with a server if the server is not using the same codeset as the client. (b) Level 1 -- Indicates that if neither client nor server can directly convert to the other's codeset then a status should be returned indicating that no binding is possible for the codeset_context provided. A no binding status may also be returned if (unspecified) error conditions exist. Level 1 is intended for those applications which want an assurance that either the client or the server has the necessary converters to convert the data to a local codeset. This assurance might be needed in cases where the overhead of a UCS conversion (see level 2 below) is not acceptable. It might also be used as part of a optimized search for a server, where binding is sought first at a low resolution level, then at higher levels if lower ones fail. As data loss may occur during conversion between codesets, this level offers less guarantee of data integrity than Level 0. This data loss should be documented as part of the specification of the converter(s). (NOTE: Usage of this level indicates a willingness by the application to accept possible data loss in cases where the local character sets of the client/server do not perfectly match. As mentioned this data loss should be well- characterized as part of the specification of the converters.) (c) Level 2 -- Indicates that if neither client nor server can directly convert to the other's codeset, the client desires ISO-10646 UCS-2 network encoding as the conversion policy. A Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 16 DCE-RFC 39.0 DCE I18N Character Handling March 1993 status indicating that binding is possible will always be returned when this resolution level is used unless (unspecified) error conditions exist. Use of Level 2 indicates that the client is more concerned with connectivity than with conversion efficiency or with data integrity. In cases where there are major character set mismatches (say between Arabic and Japanese) then only the "portable character set" as defined by OSF and DCE may be exchanged without data loss. In cases where there are only "minor" character set mismatch, this will behave the same as Level 1 resolution level (i.e., possible data loss) but with potential dual conversion to/from ISO 10646. (NOTE: Usage of this level indicates a willingness by the application to accept possible data loss in cases where the local character sets of the client/server do not perfectly match. Unlike level 1, this data loss cannot usually be characterized in advance as it depends on the interaction between two converters, one at each end of a data transmission.) In general, based on the logic defined, if a conversion policy can be successfully established, the codeset_t parameter is set to indicate the derived conversion policy. This codeset_t parameter will also be annotated to the server binding handle by a subsequent call to rpc_binding_set_codeset_info(), described in the next Subsection, 5.5. 5.5. void rpc_binding_set_codeset_info() This API allows the client to set the derived conversion policy for a server binding handle. This will enable RPC calls specified with a "dynamic" model within the ACF "codeset_convert" attribute to access the per binding handle information of the derived conversion policy. This policy was determined via a preceding call to rpc_local_encoding_resolve(), and is specified by the codeset_t parameter. This API annotates the codeset_t parameter to the server's binding handle. It is important to note that this information must be stored in the selected binding handle as a client may choose to bind with a number of servers, and subsequently have a mixture of conversion policies established with these varied servers. A companion inquiry API, rpc_binding_inq_codeset_info(), could also be provided, although the usefulness of this API is not apparent. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 17 DCE-RFC 39.0 DCE I18N Character Handling March 1993 5.6. void rpc_free_codeset_ptr () This API is needed to free the allocated local storage created by the call to rpc_ns_extensions_codeset_inq(). 5.7. Internals of Proposed NSI Extensions The NSI database must be able to store the server's exported codeset_context, and possibly other internationalization-related information that may be queried by the client. The following describes the possible format of such a structure: +-------+----------------------------------+ 1) | ## | version number of this structure | +-------+----------------------------------+ 2) | | reserved ( 32 bits ) | +-------+----------------------------------+ 3) | Y | the local server codeset | +-------+----------------------------------+ 4) | | List of Conversions to/from Y | | ### | number of codeset identifiers | | X1 | codeset identifier | } optional | X2 | codeset identifier | } optional | XN | codeset identifier | } optional +-------+----------------------------------+ Field 1 designates the version number of this structure, for backwards compatibility. Field 2 is reserved for future use. Field 3 designates the local code set of the server and is a codeset identifier type (cs_type). Field 4 is a list of codesets for which the server can support two way (round-trip) conversion with its own codeset as designated by Field 3. This field consists of a count followed by the set of codeset identifiers. It is possible that the list is empty, and thereby the count would be set to zero. Each entry in the list is a codeset identifier type (cs_type). No assumptions may be made about the conversions ability to preserve invertibility (i.e., Xn->Y->Xn is not guaranteed to preserve all characters). The conversions are, however, guaranteed not to fail due to an inability to convert particular characters. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 18 DCE-RFC 39.0 DCE I18N Character Handling March 1993 6. A DCE SUPPLIED CODESET ALIASING MECHANISM On stand-alone machines today, there is no standard method for identifying codesets or codeset converters. Hence, each system vendor decided their own designations. This has lead to a multitude of names being used for a single codeset. For example, ISO-88591 could be referenced by all of the following valid names: ISO-88591, ISO88591, ISO-LATIN1, ISO8859-1, 88591, LATIN-1, iso88591, iso-88591, iso-latin1, iso8859-1, 8859-1, latin-1, etc... To be useful in a distributed environment, these names must all map to one value. Hence, it is recommended that a DCE-supported codeset aliasing mechanism is developed, whereby all these local names could be mapped to one codeset value supported, and understood, by DCE. This codeset value should be able to be transferred over the network, such as when being used by the IDL codeset_t tag field elements specified earlier. One proposal is that this mechanism could be a table which consists of a DCE-supported codeset value and a codeset string field. The DCE codeset field should be the same type (cs_type) as the two codeset_t fields, transmit_id and response_id. The codeset string field must be permitted to be modified by DCE licensees to their appropriate string namings. In addition, it might also be beneficial to allow a "comment" field for each entry to state the proper name of this codeset. It is recommended that all OSF supported codesets are specified in this mechanism as part of the initial DCE interoperability offering. Also, it is expected that OSF will manage this list of codesets and encourage submissions of vendor-specific codesets. This proposal also mandates that a value be reserved to indicate no (cs_type) codeset value. This is used by the codeset tag to indicate no conversions are specified by a selected conversion policy, such as the RMIR model when transmitting outgoing data. For this mechanism to be useful at runtime, several API's must be provided. 6.1. cs_type rpc_codeset_lookup_id() This API performs the lookup of a codeset value, when supplied with the local system's codeset string value. In the error case where the codeset_string has no equivalent codeset value the "no codeset" value must be returned. Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 19 DCE-RFC 39.0 DCE I18N Character Handling March 1993 6.2. char * rpc_codeset_lookup_string() This API performs the lookup of a local codeset string value when supplied with the codeset value. In the error case where the codeset value has no string value associated with it, a "NULL" codeset_string value should be returned. 6.3. void rpc_local_set_codeset() This API allows an application to set the value of the local codeset retrieved by "rpc_local_inq_codeset()". 6.4. cs_type * rpc_local_inq_codeset() This API performs the lookup of a local codeset either as set by "rpc_local_set_codeset()" or through some default mechanism if "rpc_local_set_codeset()" has not been called. 7. DEPENDENCIES AND ASSUMPTIONS This Section identifies the set of dependencies required by this proposal. 7.1. Encoding of the 'universal' Model The proposed codeset to be used with the "universal" model is ISO- 10646 UCS-2. At least conversion to and from characters specified in Level 2 of this standard shall be provided. 7.2. Local Codeset Query The XPG specified nl_langinfo(CODESET) function will be used on OSF/1 implementation as the default means to determine the local codeset. 7.3. Conversion API -- libiconv.a It is expected that any required character conversions utilities will use the iconv API to invoke the needed conversion. From and to codeset names are expected to determined by the codeset tag field supplied to the RPC. The XPG4 defines a set of functions that can be used for conversion. The functions are defined as: (a) Initialize a conversion descriptor -- iconv_t iconv_open(). (b) Invoke conversion on an input string -- int iconv(). (c) Free conversion descriptor -- void iconv_free(). Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 20 DCE-RFC 39.0 DCE I18N Character Handling March 1993 It is expected that the libiconv.a and any DCE conversions will be made available to all DCE licensees. If it is acceptable to place the OSF/1 implementation as a dependency, this will be done. Otherwise, the iconv library and conversions should be provided within the DCE deliverable via some mechanism. 7.3.1. Iconv converters available on OSF/1 SJIS <-> AJEC ISO8859-1 <-> PC Code (IBM-850) ISO8859-1 <-> EBCDIC Code (IBM-500) Vendors are expected to supply their own iconv conversion modules for any proprietary codesets. 7.3.2. UCS conversions needed From To ================== ================== ISO8859-1 ISO10646.1993-2 ISO8859-7 ISO10646.1993-2 ISO8859-9 ISO10646.1993-2 SJIS ISO10646.1993-2 AJEC ISO10646.1993-2 eucKR ISO10646.1993-2 eucTW ISO10646.1993-2 ------------------ ------------------ ISO10646.1993-2 ISO8859-1 ISO10646.1993-2 ISO8859-7 ISO10646.1993-2 ISO8859-9 ISO10646.1993-2 SJIS ISO10646.1993-2 AJEC ISO10646.1993-2 eucKR ISO10646.1993-2 eucTW ------------------ ------------------ ISO8859-1 ISO10646.1993-UTF2 ISO8859-7 ISO10646.1993-UTF2 ISO8859-9 ISO10646.1993-UTF2 SJIS ISO10646.1993-UTF2 AJEC ISO10646.1993-UTF2 eucKR ISO10646.1993-UTF2 eucTW ISO10646.1993-UTF2 ------------------ ------------------ ISO10646.1993-UTF2 ISO8859-1 ISO10646.1993-UTF2 ISO8859-7 ISO10646.1993-UTF2 ISO8859-9 ISO10646.1993-UTF2 SJIS ISO10646.1993-UTF2 AJEC ISO10646.1993-UTF2 eucKR ISO10646.1993-UTF2 eucTW ================== ================== Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 21 DCE-RFC 39.0 DCE I18N Character Handling March 1993 7.4. Testing All interoperability testing will be done using OSF/1 systems running different codesets but within a common character set, e.g., JIS characters in SJIS and EUC (UJIS) encodings. 8. ACKNOWLEDGEMENTS Many people's ideas are incorporated here, we cannot hope to cite all of them. However, we would in particular like to acknowledge the inputs from Dick Mackey (OSF), Sandra Martin (OSF), Nat Mishkin (formerly at HP, now at Atria) and Tony Hinxman (DEC). Our apologies to anyone omitted. AUTHOR'S ADDRESS Sue Kline Internet email: Hewlett-Packard kline_s@apollo.hp.com 300 Apollo Drive Telephone: +1-508-436-4960 Chelmsford, MA 01824 USA Alex McLeod Internet email: International Business Machines mcleod@nlsarch.austin.ibm.com MC 9652 Telephone: +1-512-838-8183 11400 Burnet Rd Austin, TX 78758 USA Makoto Nishino Internet email: International Business Machines nishino@ymtl01.yamato.ibm.co.jp (Please write c/o Francis X. Rojas) David Obermann Internet email: International Business Machines obie@ausvm1.vnet.ibm.com ZIP 9340 Telephone: +1-512-838-0099 11400 Burnet Road Austin, TX 78758 Francis X. Rojas Internet email: International Business Machines fxrojas@nlsarch.austin.ibm.com MC 9652 Telephone: +1-512-838-8183 11400 Burnet Rd Austin, TX 78758 USA Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 22 DCE-RFC 39.0 DCE I18N Character Handling March 1993 Arne Thormodsen Internet email: Hewlett-Packard arnet@cup.hp.com 19447 Pruneridge Ave Telephone: +1-408-447-4798 Cupertino, CA 95014 USA Kline, McLeod, Nishino, Obermann, Rojas, Thormodsen Page 23