Directories,
Databases, and Indexers
WHICH? WHEN? WHY?
(Originally published in Messaging Magazine, November/December 1998)
Jitze Couperus, Control Data Systems, Inc.
Introduction
At some point, the question arises as to when to use a directory, a database, or an indexer in the context of some specific application. This article provides an overview of some of the elements that distinguish these approaches and their usage
characteristics. It first discusses the primary functions that a directory provides
in its role as a locator of objects. Then it contrasts those functions with the
environments in which databases and indexers are aimed.
The Primary Function of a Databaseas Contrasted to a Directory
The use of databases grew from the desire to have a single and non-redundant collection of information that could be shared between multiple applications. As such, it tends to contain all of the necessary facts (about some aspect of the enterprise) that are required to support the various functions of all of the applications that work with this information.
As such, a database will contain qualitative and identifying types of information (Part numbers, Employee numbers) as well as more quantitative and dynamic information (Quantities available, Salaries). Of particular notethe applications use the contents of the database as a machine-resident reflection of realitythe database is said to "model" some business area, and it is continuously updated as reality changes by means of transactions.
Thus, the key aspects of a database can be summarized as completeness (containing all the necessary elements required to model some piece of reality within the enterprise) and flux (continuous updating to ensure fidelity of the model in matching the reality).
While the differences in directory technology between X.500, LDAP-based, or proprietary solutions may determine the best choice for a specific company situation, those differences are not significant in contrasting with a database. The function of a directory is similar to a database only in that it is asked to model some aspect of realitybut not to the extent that it is the repository for all of the facts sufficient to support all applications. Its primary function instead is to act as a map to help applications find the underlying information, which may or may not reside in a database or some other mechanism. As such, a directory usually will contain only those facts that are necessary to support its function of finding a desired object and then showing the way there.
The lines between these two capabilities are not always crispindeed, each one can be (and sometimes has been) forced to fill the role of the other. However, because their primary purposes differ, their interfaces to the using communities, choices of features and optimizations to the task at hand differ.
In particular, a directory is designed to support its role as a locator of diverse kinds of objects that are spread across a distributed network.
The Use of Directory as a "Locator"
With the advent of general "inter-connectedness" of corporate IT resources via e-mail, the web, and related mechanisms, the issue arises as to how to "locate" various kinds of objects within the network space. Note the use of the word "locate" hereit connotes elements of name management, navigation, and querying. All three functions are required as a minimum to support the concept of "locating".
(a) Name Management is inextricably tied in with the hierarchical concept of addresses. As a simple metaphor, consider a conventional postal address that is used from the broadest level (Country) through successive levels of refinement (State or Zip Code, City, Street, House Number) to the narrowest level of refinement (the name of the individual addressee). Note that none of these elements is sufficient to identify an object (person in this case) uniquelyit is the total combination of the elements that finally pins down a specific object. Name Management is the term used here to cover all of the aspects of organizing and managing the names and relationships that go into such a hierarchy.
The aspect of "management" is important to understand herea directory is not merely a passive information storage and retrieval engine designed to respond to the needs of general-purpose applications. It can contain a large component of active policy enforcementreferred to as "business rules" in the context of database management systems (dbms)which brings the directory closer to a purpose-built application than a database.
(b) Querying is in essence what we do when we know a priori what we are looking for and ask for the address (how to get there). For example, give me the address for "Joe Brown".
(c) Navigation is what we do when, in looking for something, we do not know enough to formulate the query. Instead, we must locate our target by examining the choices offered to us at any point in the hierarchy. By choosing one, we can proceed to the next level, and so on, in an iterative fashion until we finally pinpoint the specific object being sought. For example, I may know I wish to contact the sales manager close to New York. I have no idea of any name involvedor even if there is a New York office. So I start at the top of the Companys Organization Chart and work my way down through successive choices at each level until I find the name and address of the person who appears to be the best fit.
It is in the context of these three elements that we can think about the implications of using a directory as a general locator of objects within a network space.
A Directory or a General Purpose Database?
The distinction between directories and databases is one that is unclear to many people, mainly because there is no single rule that differentiates between the two.
First and foremost, of course, is the "active name management" aspect noted above, whereby a directory functions as much more than just a passive storage and retrieval engine. But in addition to that function, there are other significant characteristics in behavior and usage that are useful in drawing contrasts.
Information Versus Meta-Information
It is perhaps easiest to think of a database as a repository of the objects themselveswhile a directory contains (some of the) information about objectsspecifically the information needed to exploit an object within the context of its role(s) in a connected network. Thus, a complete personnel record for an employee would be held in a personnel database, but information needed for that employee to function successfully in a networked environment would be stored in the network directory. This information may be disjoint, or it may be duplicated and "synchronized" between the database and the directoryeither through physical replication or in a virtual sense by indirect reference.
Dynamic Content Versus More Stable Meta-Content
Another characteristic worth noting is that databases tend to be more dynamic in terms of the information they contain that is updated in a transactional mode. As such, database systems are heavily geared to supporting the requirements of a "transaction" environment. Directories, on the other hand, tend to hold information about objects that is more static in nature and is heavily geared toward the functions of "locating" as outlined earlier.
By the same token, databases are characterized by a focus on storage and retrieval of arbitrary but highly structured collections of data. They are represented as a series of two-dimensional tables that conform to strict rules of normalization and are defined and operated on by a language called SQLa general programming-language style interface.
Directories, by contrast, are focused on storage and retrieval of meta-information in arbitrary hierarchies, where every field in the data itself is not only highly variable in length, but is also only meaningful in the context of the information above it in the hierarchy. (Contrast this with the normalization requirement of unique identifiers per table in the database context.) And they are operated on by a communications-style protocol like Lightweight Directory Access Protocol (LDAP).
Circumscribed Versus Federated
Another distinction between databases and directories lies in the implicit borders that circumscribe each system. This is a subtle distinction, but one with important implications on the intrinsic functionality (i.e., built-in business rules) of each system.
Specifically, a database tends to be thought of as a repository of information that models (and thus spans) a defined universebe it a work group, a department, an enterprise, or some specific application area. The database may be physically distributed or replicated, but, nevertheless, its functional borders are clearly defined, and it is expected to respond to a users needs only within the context of its own borders.
In a sense, a directory has similar borders, but at the same time, it implicitly or explicitly represents a sub-hierarchy in some larger hierarchical context. The latter case may lie under an even wider umbrella until at some level it falls under the "global" directory that is the sum of all compliant directories within the universe. Whether the keeper of a directory actually chooses to permit such federation (and to what degree) is a choice that is made by the keeperbut the relationship is always there implicitly.
Major Distinguishing Characteristics
The table below summarizes some of the salient differences between the two kinds of systems.
| General Purpose Database | Hierarchical Directory |
| Contains all of the "facts" pertaining to objects that are required by various applications that need to share common information. | Contains
primarily meta-information about certain objects referenced in a network
environmentsufficient to be able to locate them and retrieve those attributes
required by network-oriented applications. |
| Relatively volatile (frequently modified) information and optimized to online transaction processing (OLTP). | Relatively static
information and optimized toward rapid retrieval. |
| Well-defined universe of information with distinct boundariespossibly distributed. | Participates with
other directories in a federated mannerprobably distributed and replicated (either
physically or virtually). |
| Intrinsic intelligence (business rules) limited to the general purpose nature of a database, limited to primarily uniqueness and relationship constrains (user code required for application-specific intelligence). | Extensive
built-in "business rules" (in some cases defined by the standard) to support the
directory functionsuch as automatic referral, hierarchical inheritance, security
mechanisms, etc. |
| Databases will each have their own access point(s)usually custom-built applications. | A federation of
directories will have one virtual access point (either through federation or brokered by a
meta directory). |
| Structuring
ability limited to a collection of two-dimensional tables containing simple data values.
Each table is homogenous within itself and linked to others by comparable domains. |
Hierarchical structuring of objects with associated attributes. |
| Schema (database
structure definition) is arbitrarily defined by application designer and thus only usable
by applications with encoded knowledge of the structure and data semantics. |
Schema constrained to conform to certain rules enabling intelligent federation with any other conforming directories. |
A Directory or General Purpose Indexer?
Looking up a name in the "white pages" of a telephone directory can be equated with a simple query. A human uses what is in essence a binary search over an alphabetically ordered list of names to find out some attribute (telephone number) associated with the object being queried (the name of a person with a telephone).
"Yellow pages," on the other hand, can be seen as a one-deep hierarchy containing many separate ordered lists (white pages) at its lower level. Thus, I might search first for "Plumbers" at the top level (again based on alphabetical ordering of the professions) and then within the "plumbers section," I select the one I want.
I have here "navigated" a one-deep hierarchy. However, in a networked world, such hierarchies may be of arbitrary depthwith the "objects" residing at different depths within different parts of the tree. A cursory examination of any enterprise organization chart bears this out as a useful metaphor.
A typical network environment contains many different kinds of objects that need to be
"located" (people, documents, network nodes, servers, etc.). Each kind of object
has its own (different) associated
set of attributes (the meta-information being managed). In addition, there may be multiple
hierarchies reflecting different dimensionsone perhaps reflecting the enterprise
organization chart, one reflecting the geographic topology of the corporate network, and
so forth.
Thus, typical usage of an electronic directory (whether by a human using a software directory agent or by an LDAP-enabled application) will usually reflect some degree of navigation to a point in the tree followed by a query constrained to the set of objects below that portion of the tree.
Metaphorically, I navigate to the required phone book, navigate to the yellow pages, navigate to the plumbers, and then search for the one called "Jones"or I may review the complete list of plumbers and navigate to the one that lives closest to me.
I may choose a plumber on the basis of his name (I used him last time, and I want the same one) or I may choose a plumber on the basis of any number of other "attributes" carried by this "object," such as location, 24-hour availability, or another service attribute. Whatever the reason, the salient point is that I first narrowed down my selection of a plumber by means of navigation. I then chose a specific plumber either by querying on the resulting set or by further navigation in which I examined the desired attributes of each plumber until locating a desired target.
This "navigate followed by search/select" paradigm is a prominent characteristic of directory usage. (The navigational part is sometimes referred to as "setting the search base" or "pruning the search tree.")
Having established this metaphorlet us examine, for example, how it might apply in the case of a Web-enabled intranet that contains perhaps hundreds of thousands of objects ("pages" in a Web context) that carry useful information for some user community.
When first faced with the requirement to "locate" a page containing desired information from amongst many contained within a Web, a simple first approach is to create an index of every word within all of the pages and point each index entry to the various pages containing that word. Then the user can peruse until the desired target is found.
This is in fact what many of the Web-crawler indexing engines do on the World Wide Web. The problem with this approach is that the user frequently is confronted with thousands of "hits" and must rely on the indexing engine to have arranged them (via some arbitrary scheme) so that the most interesting ones (from whose viewpoint?) are sorted to the top of the list and can be individually examined by the user until an acceptable one is found.
Other Web-crawlers combine this approach with a hierarchical classification scheme that can be used for navigation or "tree pruning" (more or less as described above), which allows a user to refine the search base by navigating over and descending successive levels prior to exerting a query or making a selection from a pared list. But even in this case, the entries in the pruned list contain only minimal information to allow the user to make a sensible choiceshort of retrieving and examining each candidate document in detail.
In the case of Web-crawlers that are indexing arbitrary textual objects (Web pages) there is of course little alternative to this approach. But when the objects in the network space possess some given set of attributes (e.g., people objects might have phone numbers, e-mail addresses, physical location, functional title, etc.), then a purpose-built directory can provide the user with the ability to perform a sensible search or selection in the final phase of the "navigate followed by search/select" paradigm outlined above. Once the user has selected the desired target, the directory can provide either the "pointer" to the object itself (such as the e-mail address or the network URL of the target) or the actual desired attribute if that happens to be one that is held in the directory (such as the users public key).
From this discussion we can distill three key points:
While this may appear obvious in retrospect, it provides some answer to the question of when one might use a general-purpose Web-indexing engine versus when a directory would be more appropriate.
If the objects for which a directory is being built are amorphous (such as arbitrary Web pages), then there is little alternative but to unleash a general indexing mechanism against them. Index every word encountered, and leave it to the end-user to "prune" any result set returned from a queryas best as he or she can. This is a serious problem for which the only mitigating strategy so far has been for the indexer to also store a textual snippet from the base document and provide this to the searcher in the hope that it may provide some handle on choosing the desired target(s) in the final "select" phase.
On the other hand, if the objects for which a directory is being built are associated with a definable set of attributes, then a much more productive approach can be implemented by means of a directory. Through judicious choice of object attributes that allow the user to navigate and then search or select on an intelligent basis, far more rapid and accurate discovery of targets is possible.
With the advent of general connectivity amongst corporate IT resources via e-mail, the Web, and related mechanisms, the issue arises of how to "locate" various kinds of objects within the network space. It is in the context of the elements of name management, navigation, and querying that we can think about the implications of using a directory as a general locator of objects within a network space. When the objects in the network space possess some given set of attributes, a directorys use of the paradigm "navigate followed by search/select" is clearly the best choice for rapid location of network objects.