Previous section.

Protocols for Interworking: XNFS, Version 3W
Copyright © 1998 The Open Group

XNFS: Protocol Specification, Version 2

This chapter specifies a protocol that Sun Microsystems, Inc. and others are using. It is derived from a document designated RFC 1094 by the ARPA Network Information Center (see References to RFCs ).

Introduction

The Network File System (NFS) protocol provides transparent remote access to shared file systems over local area networks. The NFS protocol is designed to be machine, operating system, network architecture and transport protocol-independent. This independence is achieved through the use of Remote Procedure Call (RPC) primitives built on top of an External Data Representation (XDR). Implementations exist for a variety of machines, from personal computers to supercomputers.

The supporting mount protocol allows the server to hand out remote access privileges to a restricted set of clients. It performs the operating system-specific functions that allow a client to attach remote directory trees to a local file system. The supporting mount protocol (see Mount Protocol ) is used by a client to obtain access to a particular file system, or a subset thereof. The server will provide a "handle" which the client can use to identify the file system in subsequent NFS operations. Typically, the client will use the handle to arrange for the remote file system to appear to the user as part of the local file system.

Remote Procedure Call

The remote procedure call specification provides a procedure-oriented interface to remote services. Each server supplies a program that is a set of procedures. NFS is one such "program". The combination of host address, program number and procedure number specifies one remote service procedure. RPC does not depend on services provided by specific protocols, so it can be used with any underlying transport protocol (see Remote Procedure Calls: Protocol Specification ). The remote procedure call specification provides a procedure-oriented interface to remote services. Each server supplies a program that is a set of procedures. NFS is one such "program". The RPC protocol is described in Remote Procedure Calls: Protocol Specification .

External Data Representation

The External Data Representation (XDR) standard provides a common way of representing a set of data types over a network. The NFS Protocol Specification is written using the RPC data description language. For more information, see XDR Protocol Specification . Implementations of XDR and RPC are available in the public domain, but XNFS does not require their use. Any software that provides equivalent functionality can be used, and if the encoding is exactly the same it can interoperate with other implementations of XNFS.

Stateless Servers and Idempotency

The NFS protocol is stateless, in that a server need not maintain any state about the clients which it serves. It may in fact store state to improve performance, but this state is not necessary for correct operation. This means that the protocol does not include any mechanisms for managing server or client failure and restart. However, NFS deals with objects such as files and directories which inherently have state. This apparent contradiction is resolved by introducing distributed state and by making operations idempotent.

Distributed state arises when an NFS server passes information such as a file handle or directory search cookie to a client. The server promises, in effect, that when the client passes this information back to the server at a later date, it will usually still be valid and can be used to reconstruct the state needed to perform the requested operation. If the server detects that the state is invalid, it responds with an indication of the problem. In some cases the client may pass the response to the calling application. In other cases the client may take some corrective action and retry the operation.

With a few exceptions, rebooting the server must not invalidate distributed state information. One exception is that the state associated with unstable writes (see tagmref_NFSPROC3_WRITE ) may be invalidated when the server reboots. Another exception is that the state associated with temporary file systems, that is, those that are recreated from scratch by the reboot may be invalidated. This implies that distributed state will usually refer to objects held on stable server storage, though servers may employ caching techniques to accelerate the interpretation of this state in the normal case when no reboot has occurred.

An idempotent operation is one which can be repeated several times without changing the results. For example, a request to write 5 bytes at offset 165 in a file is idempotent; a request to write 5 bytes at the current end-of-file is not. NFS employs idempotent operations wherever possible. Certain operations are inherently not idempotent, for example, deleting a file, so NFS server implementations will normally include mechanisms to attempt to detect duplicate requests and furnish the appropriate results. Occasionally this strategy will fail and a client will receive an unexpected error; NFS clients and their applications must be tolerant of such occurrences.

XNFS Protocol Definition

Servers can change over time, and so can the protocol that they use. RPC therefore provides a version number with each RPC request. This chapter describes version 2 of the NFS protocol. It contains procedures and parameters which are unused (obsolete) but which are retained for compatibility purposes. NFS server implementations should be prepared to handle these appropriately.

File System Model

NFS assumes a file system that is hierarchical, with directories as all but the bottom-level files. Each entry in a directory (file, directory, device, and so on) has a string name. Different operating systems may have restrictions on the depth of the tree or the names used, as well as using different syntax to represent the "pathname", which is the concatenation of all the "components" (directory and filenames) in the name. A "file system" is a tree on a single server (usually a single disk or physical partition) with a specified "root". Some operating systems provide a "mount" operation to make all file systems appear as a single tree, while others maintain a "forest" of file systems. Ordinary files are unstructured streams of uninterpreted bytes.

NFS looks up one component of a pathname at a time. It may not be obvious why it does not just take the whole pathname, travel down the directories, and return a file handle when it is done. There are several good reasons not to do this. First, pathnames need separators between the directory components, and different operating systems use different separators. A Network Standard Pathname Representation could be defined, but then every pathname would have to be parsed and converted at each end. Other issues are discussed in XNFS Implementation Issues .

An exception to the single component lookup policy can be made in the case of a multi-component lookup relative to a public filehandle (see WebNFS Extensions ). In this case the pathname is required to be slash (/) separated and evaluated by the server. The server must evaluate any symbolic links that occur in intermediate components of the path, but not a link that occurs as the final component.

Although files and directories are similar objects in many ways, different procedures are used to read directories and files. This enforces a common network representation of directory contents and places the XDR encoding of this information directly in the NFS protocol, rather than overloading the interpretation of file access operations. It also enforces an access model in which it is important to retrieve partial directory information or to start a directory search at an invalid point. The same argument as above could have been used to justify a procedure that returns only one directory entry per call. However, directories can contain many entries, and a remote call to return each would lead to unacceptable performance.

Symbolic Links

The NFS file system model includes the concept of symbolic links, in which a directory entry is associated with a piece of text instead of a file or directory. An NFS client which encounters a symbolic link while processing a path will normally issue an NFSPROC_READLINK to retrieve the text, and will then treat this as a path and look up the components to locate the actual file or directory. An NFS server need not implement symbolic links; if it does not, it must be prepared to return a PROC_UNAVAIL error if a client invokes NFSPROC_READLINK or NFSPROC_SYMLINK. Similarly, an NFS client should only issue an NFSPROC_READLINK if a NFSPROC_LOOKUP returns an entry typed as an NFLNK, and should be prepared to handle failures of any symbolic link operations.

RPC Information

Authentication

The NFS service uses AUTH_UNIX style authentication, except in the NULL procedure where AUTH_NONE is also permitted.

Transport Protocols

Current implementations of NFS are supported over UDP/IP only.

Port Number

The NFS protocol uses the UDP portnumber 2049 decimal. Since this is not an officially assigned port, it is possible that it may change in the future. For maximum interoperability it is recommended (but not required) that NFS servers use UDP port 2049 if possible, and that NFS clients use the portmap mechanism to locate the NFS program on a server.

WebNFS servers must use UDP and TCP port 2049.

Sizes of XDR Structures

These are the sizes, given in decimal bytes, of various XDR structures used in the protocol:

/* * The maximum number of bytes of data in a READ or * WRITE request. */ const NFS_MAXDATA = 8192; /* The maximum number of bytes in a pathname argument. */ const NFS_MAXPATHLEN = 1024; /* The maximum number of bytes in a filename argument. */ const NFS_MAXNAMLEN = 255; /* * The size in bytes of the opaque "cookie" passed by * READDIR. */ const NFS_COOKIESIZE = 4; /* The size in bytes of the opaque file handle. */ const NFS_FHSIZE = 32;

Basic Data Types

The following XDR definitions are basic structures and types used in other structures described later.

stat
enum stat { NFS_OK = 0, NFSERR_PERM=1, NFSERR_NOENT=2, NFSERR_IO=5, NFSERR_NXIO=6, NFSERR_ACCES=13, NFSERR_EXIST=17, NFSERR_NODEV=19, NFSERR_NOTDIR=20, NFSERR_ISDIR=21, NFSERR_FBIG=27, NFSERR_NOSPC=28, NFSERR_ROFS=30, NFSERR_NAMETOOLONG=63, NFSERR_NOTEMPTY=66, NFSERR_DQUOT=69, NFSERR_STALE=70, };

The stat type is returned with every procedure's results. A value of NFS_OK indicates that the call completed successfully and the results are valid. The other values indicate some kind of error occurred on the server side during the servicing of the procedure.

NFSERR_PERM
Not owner. The caller does not have the correct ownership to perform the requested operation.

NFSERR_NOENT
No such file or directory. The file or directory specified does not exist.

NFSERR_IO
Some sort of hard error occurred when the operation was in progress. This could be a disk error, for example.

NFSERR_NXIO
No such device or address.

NFSERR_ACCES
Permission denied. The caller does not have the correct permission to perform the requested operation.

NFSERR_EXIST
File exists. The file specified already exists.

NFSERR_NODEV
No such device.

NFSERR_NOTDIR
Not a directory. The caller specified a non-directory in a directory operation.

NFSERR_ISDIR
Is a directory. The caller specified a directory in a non-directory operation.

NFSERR_FBIG
File too large. The operation caused a file to grow beyond the server's limit.

NFSERR_NOSPC
No space left on device. The operation caused the server's file system to reach its limit.

NFSERR_ROFS
Read-only file system. Write attempted on a read-only file system.

NFSERR_NAMETOOLONG

File name too long. The filename in an operation was too long.

NFSERR_NOTEMPTY

Directory not empty. Attempted to remove a directory that was not empty.

NFSERR_DQUOT
Disk quota exceeded. The client's disk quota on the server has been exceeded.

NFSERR_STALE
The fhandle given in the arguments was invalid. That is, the file referred to by that file handle no longer exists, or access to it has been revoked.

ftype
enum ftype { NFNON = 0, NFREG = 1, NFDIR = 2, NFBLK = 3, NFCHR = 4, NFLNK = 5 };

The enumeration ftype gives the type of a file. The type NFNON indicates a non-file, NFREG is a regular file, NFDIR is a directory, NFBLK is a block-special device, NFCHR is a character-special device, and NFLNK is a symbolic link.

nfscookie
typedef opaque nfscookie[NFS_COOKIESIZE];

The nfscookie is an opaque value that identifies a particular piece of data, such as a directory entry in the NFSPROC_READDIR call.

fhandle
typedef opaque fhandle[NFS_FHSIZE];

The fhandle is the file handle passed between the server and the client. All file operations are done using file handles to refer to a file or directory. The file handle can contain whatever information the server needs to distinguish an individual file.

A filehandle that consists of 32 zero bytes is called the public filehandle. It is used by WebNFS clients to identify an associated public directory on the server. See WebNFS Extensions for further information.

timeval
struct timeval { unsigned int seconds; unsigned int useconds; };

The timeval structure is the number of seconds and microseconds since midnight January 1, 1970, Greenwich Mean Time. It is used to pass time and date information.

diropok
struct diropok { fhandle file; fattr attributes; };

The diropok structure is used by the server to return the file handle and attributes of a file after a successful NFSPROC_LOOKUP, NFSPROC_CREATE or NFSPROC_MKDIR operation.

fattr
struct fattr { ftype type; unsigned int mode; unsigned int nlink; unsigned int uid; unsigned int gid; unsigned int size; unsigned int blocksize; unsigned int rdev; unsigned int blocks; unsigned int fsid; unsigned int fileid; timeval atime; timeval mtime; timeval ctime; };

The fattr structure contains the attributes of a file; type is the type of the file; nlink is the number of hard links to the file (the number of different names for the same file); uid is the user identification number of the owner of the file; gid is the group identification number of the group of the file; size is the size in bytes of the file; blocksize is the preferred block size in bytes for the file; rdev is the device number of the file if it is type NFCHR or NFBLK; blocks is the number of 512-byte blocks the file takes up on the server; fsid is the file system identifier for the file system containing the file; fileid is a number that uniquely identifies the file within its file system; atime is the time when the file was last accessed for either read or write; mtime is the time when the file data was last modified (written), and ctime is the time when the status of the file was last changed. Writing to the file also changes ctime if the size of the file changes.

mode is the access mode encoded as a set of bits. Notice that the file type is specified both in the mode bits and in the file type; the server must ensure they are consistent.

The descriptions given below specify the bit positions using octal numbers.


Bit Description
0040000 This is a directory; type field must be NFDIR.
0020000 This is a character special file; type field must be NFCHR.
0060000 This is a block special file; type field must be NFBLK.
0100000 This is a regular file; type field must be NFREG.
0120000 This is a symbolic link file; type field must be NFLNK.
0140000 This is a named socket; type field must be NFNON.
0004000 Set user ID on execution.
0002000 Set group ID on execution.
0001000 Not used.
0000400 Read permission for owner.
0000200 Write permission for owner.
0000100 Execute and search permission for owner.
0000040 Read permission for group.
0000020 Write permission for group.
0000010 Execute and search permission for group.
0000004 Read permission for others.
0000002 Write permission for others.
0000001 Execute and search permission for others.



Notes:

  1. The bits correspond to the mode bits returned by the stat() XSI system call, with the addition of the socket and symbolic link combinations which are supported by NFS and some operating systems.

  2. The rdev field in the attributes structure is an operating system-specific device specifier.

sattr
struct sattr { unsigned int mode; unsigned int uid; unsigned int gid; unsigned int size; timeval atime; timeval mtime; };

The sattr structure contains the file attributes which can be set from the client. The fields are the same as for fattr above. A value of 0xffffffff indicates a field that must be ignored. A size of zero means the file must be truncated to zero length.

filename
typedef string filename<NFS_MAXNAMLEN>;

The type filename is used for passing filenames or pathname components. A string length of zero is invalid.

Implementations and applications must be able to handle file names as 8-bit transparent data (allowing use of arbitrary character set encodings). For maximum portability and interworking, it is recommended that applications and users define file names containing only the characters of the Portable Filename Character Set defined in ISO/IEC 9945-1:1990.

path
typedef string path<NFS_MAXPATHLEN>;

The type path is a pathname to be used in the symbolic link operations NFSPROC_SYMLINK and NFSPROC_READLINK. The server must consider it as a string with no internal structure. A string length of zero is invalid.

For maximum portability and interworking, it is recommended that applications and users define path names containing only the slash character (if required) plus the characters of the Portable Filename Character Set defined in ISO/IEC 9945-1:1990.

attrstat
union attrstat switch (stat status) { case NFS_OK: fattr attributes; default: void; };

The attrstat structure is a common procedure result. It contains a status and, if the call succeeded, it also contains the attributes of the file on which the operation was performed.

diropargs
struct diropargs { fhandle dir; filename name; };

The diropargs structure is used in directory operations. The fhandle dir is the directory in which to find the file name. A directory operation is one in which the directory is affected.

diropres
union diropres switch (stat status) { case NFS_OK: struct diropok diropok; default: void; };

The results of a directory operation are returned in a diropres structure. If the call succeeded, a new file handle file and the attributes associated with that file are returned along with the status.

XNFS Implementation Issues

The NFS protocol is designed to be operating system-independent, but since this version was designed in a UNIX environment, many operations have semantics similar to the operations of the UNIX file system. This section discusses some of the implementation-specific semantic issues.

Server/Client Relationship

Every NFS client can also potentially be a server, and remote and local mounted file systems can be freely intermixed. This leads to some interesting problems when a client travels down the directory tree of a remote file system and reaches the mount point on the server for another remote file system. Allowing the server to follow the second remote mount would require loop detection, server lookup and user revalidation. Instead, it was decided not to let clients cross a server's mount point.

When a client does an NFSPROC_LOOKUP on a directory on which the server has mounted a file system, the client sees the underlying directory instead of the mounted directory. A client can do remote mounts that match the server's mount points to maintain the server's view.

Permission Issues

The NFS protocol, strictly speaking, does not define the permission checking used by servers. However, it is expected that a server will do normal operating system permission checking using AUTH_UNIX style authentication as the basis of its protection mechanism. The server gets the client's effective UID, effective GID and groups on each call, and uses them to check permission. There are various problems with this method that can be resolved in interesting ways.

Using UID and GID implies that the client and server share the same UID list. Every server and client pair must have the same mapping from user to UID and from group to GID. Since every client can also be a server, this tends to imply that the whole network shares the same UID/GID space.

Another problem arises due to the usually stateful open operation. Most operating systems check permission at open time, and then check that the file is open on each read and write request. With stateless servers, the server has no idea that the file is open and must do permission checking on each read and write call. On a local file system, a user can open a file and then change the permissions so that no one is allowed to touch it, but will still be able to write to the file because it is open. On a remote file system, by contrast, the write would fail. To get around this problem, the server's permission checking algorithm should allow the owner of a file to access it regardless of the permission setting.

A similar problem has to do with paging in from a file over the network. The operating system usually checks for execute permission before opening a file for demand paging, and then reads blocks from the open file. The file may not have read permission, but after it is opened it doesn't matter. An NFS server cannot tell the difference between a normal file read and a demand page-in read. To make this work, the server allows reading of files if the UID given in the call has execute or read permission on the file.

In most operating systems, a particular user has access to all files no matter what permission and ownership they have, an NFS client request on behalf of such a user will be made with the user ID of zero. This "super-user" permission might not be allowed on the server, since anyone who can gain that privilege on their client system could gain access to all remote files. An XNFS server, by default, maps user ID 0 to -2 (0xfffffffe) before doing its access checking. A server implementation may provide a mechanism to change this mapping.

Server Procedures

The protocol definition is given as a set of procedures with arguments and results defined using the RPC language. A brief description of the function of each procedure should provide enough information to allow implementation.

All of the procedures in the NFS protocol are synchronous. When a procedure returns to the client, the operation has completed and any data associated with the request is now on stable storage. For example, a client NFSPROC_WRITE request will cause the server to update some or all of the following: data blocks, file system information blocks (such as indirect blocks), and file attribute information (size and modify times). When the NFSPROC_WRITE returns to the client, it can assume that the write is safe, even in case of a server crash, and it can discard the data written. This is a very important part of the statelessness of the server. If the server waited to flush data from remote requests, the client would have to save those requests so that it could resend them in case of a server crash.

/* * Remote file service routines */ program NFS_PROGRAM { version NFS_VERSION { void NFSPROC_NULL(void) = 0; attrstat NFSPROC_GETATTR(fhandle)= 1; attrstat NFSPROC_SETATTR(sattrargs) = 2; void NFSPROC_ROOT(void) = 3; diropres NFSPROC_LOOKUP(diropargs) = 4; readlinkres NFSPROC_READLINK(fhandle) = 5; readres NFSPROC_READ(readargs) = 6; void NFSPROC_WRITECACHE(void) = 7; attrstat NFSPROC_WRITE(writeargs) = 8; diropres NFSPROC_CREATE(createargs) = 9; stat NFSPROC_REMOVE(diropargs) = 10; stat NFSPROC_RENAME(renameargs) = 11; stat NFSPROC_LINK(linkargs) = 12; stat NFSPROC_SYMLINK(symlinkargs) = 13; diropres NFSPROC_MKDIR(createargs) = 14; stat NFSPROC_RMDIR(diropargs) = 15; readdirres NFSPROC_READDIR(readdirargs) = 16; statfsres NFSPROC_STATFS(fhandle) = 17; } = 2; } = 100003;
The following reference pages define each of the server mapper procedures.


Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy of this publication.

Contents Next section Index