This paper appears as Chapter 1 in the book Distributed Systems, edited by Sape Mullender and published by Addison-Wesley and ACM Press in 1993, 1994, and 1995. Copyright 1993 by the ACM Press. A State-of-the-Art Distributed System: Computing with BOB Michael D. Schroeder DEC Systems Research Center 130 Lytton Ave Palo Alto, CA 94301 mds@pa.dec.com Distributed systems are a popular and powerful computing paradigm. Yet existing examples have serious short-comings as a base for general-purpose computing. This chapter explores the characteristics, strengths, and weaknesses of distributed systems. It describes a model for a state-of-the-art distributed system that can do a better job of supporting general-purpose computing than existing systems. This model system combines the best features of centralized and networked systems, with improved availability and security. The chapter concludes by outlining the technical approaches that appear most promising for structuring such a system. This chapter does not specifically discuss application-specific distributed systems, such as automated banking systems, control systems for roaming and tracking cellular telephones, and retail point-of-sales systems, although there are many economically important examples. The issues raised in the chapter do apply to such systems and the model system described here should be a suitable base for many of them. Some systems that are designed to support a narrow set of uses, however, may need to be structured in application-specific ways that can be simpler and more efficient for those uses. The extent to which special-purpose systems should be built on a general-purpose distributed base needs to be investigated further. This chapter also does not address real-time distributed systems, such as control systems for factories, aircraft, or automobiles, which face distinctive scheduling and resource utilization requirements. 1. Characteristics of Distributed Systems A distributed system is several computers doing something together. Thus, a distributed system has three primary characteristics. o Multiple Computers -- A distributed system contains more than one physical computer, each consisting of cpu's, some local memory, possibly some stable storage like disks, and I/O paths to connect it with the environment. o Interconnections -- Some of the I/O paths will interconnect the computers. If they cannot talk to each other, then it is not going to be a very interesting distributed system. o Shared State -- The computers cooperate to maintain some shared state. Put another way, if the correct operation of the system is described in terms of some global invariants, then maintaining those invariants requires the correct and coordinated operation of multiple computers. Building a system out of interconnected computers requires that three major issues be addressed. o Independent Failure -- Because there are several distinct computers involved, when one breaks others may keep going. Often we want the "system" to keep working after one or more have failed. o Unreliable Communication -- Because in most cases the interconnections among computers are not confined to a carefully controlled environment, they will not work correctly all the time. Connections may be unavailable. Messages may be lost or garbled. One computer cannot count on being able to communicate clearly with another, even if both are working. o Insecure Communication -- The interconnections among the computers may be exposed to unauthorized eavesdropping and message modification. A centralized system that supports multiple processes and provides some form of interprocess communication, e.g. a Unix timesharing system, can exhibit in virtual form the three primary characteristics of a distributed system. A centralized system may even manifest independent failure, e.g. the demon process for mail transport may crash without stopping interactive user processes on the same system. Thus, design and programming techniques associated with communicating sequential processes in centralized systems form part of the basic techniques in the distributed systems arena. However, centralized systems are usually successful without dealing with independent failure and never need to confront unreliable and insecure communications. In a distributed system all three issues must be addressed. The canonical example of a general-purpose distributed system today is a networked system -- a set of workstations/PCs and servers interconnected with a network. 2. Networked vs Centralized Systems It is easy to understand why networked systems are popular. Such systems allow the sharing of information and resources over a wide geographic and organizational spread. They allow the use of small, cost-effective computers and get the computing cycles close to the data. They can grow in small increments over a large range of sizes. They allow a great deal of autonomy through separate component purchasing decisions, selection of multiple vendors, use of multiple software versions, and adoption of multiple management policies. Finally, they do not necessarily all crash at once. Thus, in the areas of sharing, cost, growth, and autonomy, networked systems are better than traditional centralized systems as exemplified, say, by timesharing. On the other hand, centralized systems do some things better than today's networked systems. All information and resources in a centralized system are equally accessible. Functions work the same way and objects are named the same way everywhere in a centralized system. And a centralized system is easier to manage. So despite the advantages of networked systems, centralized systems are often easier to use because they are more accessible, coherent, and manageable. In the areas of security and availability, the comparison between networked systems and centralized systems produces no clear-cut advantage for either. 2.1 Security In usual practice to date, neither centralized nor networked systems offer real security, but for different reasons. A centralized system has a single security domain under the control of a single authority. The trusted computing base is contained in a single operating system. That operating system executes on a single computer that can have a physically secure environment. With all the security eggs in one basket, so to speak, users understand the level of trust to assign to the system and who to speak to when problems arise. On the other hand, it is notoriously difficult to eliminate all the security flaws from an operating system or from the operating environment, and with a single security domain one such flaw can be exploited to break the security of the entire system. Networked systems have multiple security domains and thus exhibit the inverse set of security properties. The trusted computing base is scattered among many components that operate in environments with varying degrees of physical security, differing security policies, and possibly under different authorities. The interconnections among ther computers are physically insecure. It is hard to know what is being trusted and what can be trusted. But, because the system contains many computers, exploiting a security flaw in the software or environment of one computer does not automatically compromise the entire system. 2.2 Availability A similar two-sided analysis applies to availability. A centralized system can have a controlled physical and operational environment. Since a high proportion of system failures are the result of operational and environmental factors, careful management of this single environment can produce good availability. But when something does go wrong the whole system goes down at once, stopping all users from getting work done. In a networked system the various computers fail independently. However, it is often the case that several computers must be in operation simultaneously before a user can get work done, so the probability of the system failing is greater than the probability of one component failing. This increased probability of not working, compared to a centralized system, is the result of ignoring independent failure. The consequence is Leslie Lamport's definition of a distributed system: "You know you have one when the crash of a computer you've never heard of stops you from getting any work done." On the other hand, independent failure in a distributed system can be exploited to increase availability and reliability. When independent failure is properly harnessed by replicating functions on independent components, multiple component failures are required before system availability and reliability suffer. The probability of the system failing thus can be less than the probability in a centralized system. Dealing with independent failure to avoid making availability worse, or even to make it better, is a major task for designers of distributed systems. A distributed system also must cope with communication failures. Unreliable communication not only contributes to unavailability, it can lead to incorrect functioning. A computer cannot reliably distinguish a down neighbor from a disconnected neighbor and therefore can never be sure an unresponsive neighbor has actually stopped. Maintaining global state invariants in such circumstances is tricky. Careful design is required to actually achieve correct operation and high availability using replicated components. 2.3 A State-of-the-Art Distributed System It seems feasible to develop a distributed system that combines the accessibility, coherence, and manageability advantages of centralized systems with the sharing, growth, cost, and autonomy advantages of networked systems. If real security and high availability were added to the mix, then we would have a state-of-the-art computing base for many purposes. Achieving this combination of features in a single system is the central challenge of supporting general-purpose computing well with a distributed system. No existing system fulfills this ideal. 3. The Properties and Services Model We can describe this best-of-both-worlds (BOB) distributed computing base in terms of a model of its properties and services. This more technical description of the goals provides a structure that will help us to understand the mechanisms needed to achieve them. The properties and services model defines BOB as: o a heterogeneous set of hardware, software, and data components; o whose size and geographic extent can vary over a large range; o connected by a network; o providing a uniform set of services (naming, remote invocation, user registration, time, files, etc); o with certain global properties (names, access, security, management and availability). Because we are talking about a base for general-purpose computing, the model is defined in terms most appropriate to understanding by the programmers who develop components that are part of the base and who develop the many different applications that are to be built on top of it. But the fundamental coherence provided by the model will show through such components and applications (when they are correctly implemented) to provide a coherent system as viewed by its human users too. The coherence that makes BOB a system rather than a collection of computers is a result of its uniform services and global properties. The services are available in the same way to every part of the system and the properties allow every part of the system to be viewed in the same way, regardless of system size. Designers are well aware of the care that must be taken to produce implementations that can support growth to very large sizes (tens or hundreds of thousands of nodes). A similar challenge exists in making such expandable systems light-weight and simple enough to be suitable for very small configurations too. BOB's coherence does not mean that all the components of the system must be the same. The model applies to a heterogeneous collection of computers running operating systems such as Unix, VMS, MS-DOS, Windows NT, and others. In short, all platforms can operate in this framework, even computers and systems from multiple vendors. The underlying network can be a collection of local area network segments, bridges, routers, gateways, and various types of long distance services with connectivity provided by various transport protocols. 3.1 Properties What do BOB's pervasive properties mean in more detail? o Global names -- the same names work everywhere. Machines, users, files, distribution lists, access control groups, and services have full names that mean the same thing regardless of where in the system the names are used. For instance, Butler Lampson's user name might be something like /com/dec/src/bwl throughout the system. He will operate under that name when using any computer. Global naming underlies the ability to share things. o Global access -- the same functions are usable everywhere with reasonable performance. If Butler sits down at a machine when visiting in California, he can do everything there that he can do when in his usual office in Massachusetts, with perhaps some performance degradations. For instance, from Palo Alto Butler could command the local printing facilities to print a file stored on his computer in Cambridge. Global access also includes the idea of data coherence. Suppose Butler is in Cambridge on the phone to Mike Schroeder in Palo Alto and Butler makes a change to a file and writes it. Mike should be able to read the new version as soon as Butler thinks he has saved it. Neither Mike nor Butler should have to take any special action to make this possible. o Global security -- the same user authentication and access control work everywhere. For instance, Butler can authenticate himself to any computer in the system; he can arrange for data transfer secure from eavesdropping and modification between any two computers; and assuming that the access control policy permits it, Butler can use exactly the same mechanism to let the person next door and someone from another site read his files. All the facilities that require controlled access (logins, files, printers, management functions, etc.) use the same machinery to provide access control. o Global management -- the same person can manage components anywhere. Obviously one person will not manage all of a large system. But the system should not impose a priori constraints on which set of components a single person can manage. All of the components of the system provide a common interface to management tools. The tools allow a manager to perform the same action on large numbers of components at once. For instance, a single system manager can configure all the workstations in an organization without leaving his office. o Global availability -- the same services work even after some failures. System managers get to decide (and pay for) the level of replication for each service. As long as the failures do not exceed the redundancy provided, each service will go on working. For instance, a group might decide to duplicate its file servers but get by with one printer per floor. System-wide policy might dictate a higher level of replication for the underlying communication network. Mail does not have to fail between Palo Alto and Cambridge just because some machine goes down in Lafayette, Indiana. 3.2 Services The standard services defined by BOB include the following fundamental facilities: o Names -- provides access to a replicated, distributed database of global names and associated values for machines, users, files, distribution lists, access control groups, and services. A name service is the key BOB component for providing global names, although most of the work involved in implementing global names is making all the other components of the distributed system, e.g. existing operating systems, use the name service in a consistent way. o Remote procedure call -- provides a standard way to define and securely invoke service interfaces. This allows service instances to be local or remote. The RPC mechanism can be organized to operate by dynamically choosing one of a variety of transport protocols. Choosing RPC for the standard service invocation mechanism does not force blocking call semantics on all programs. RPC defines a way to match response messages with request messages. It does not require that the calling program block to await a response. Methods for dealing with asynchrony inside a single program are a local option. Blocking on RPC calls is a good choice when the local environment provides multiple threads per address space. o User registrations -- allows users to be registered and authenticated and issues certificates permitting access to system resources and information. o Time -- distributes consistent and accurate time globally. o Files -- provides access to a replicated, distributed, global file system. Each component machine of BOB can make available the files it stores locally through this standard interface. The multiple file name spaces are connected by the name service. The file service specification should include standard presentations for the different VMS, Unix, etc. file types. For example, all implementations should support a standard view of any file as an array of bytes. o Management -- provides access to the management data and operations of each component. In addition to these base level facilities, BOB can provide other services appropriate to the intended applications, such as: o Records -- provides access to records, either sequentially or via indexes, with record locking to allow concurrent reading and writing, and journaling to preserve integrity after a failure. o Printers -- allows printing throughout the network of documents in standard formats such as Postscript and ANSI, including job control and scheduling. o Execution -- allows programs to be run on any machine (or set of machines) in the network, subject to access and resource controls, and efficiently schedules both interactive and batch jobs on available machines, taking account of priorities, quotas, deadlines, and failures. The exact configuration and utilization of cycle servers (as well as idle workstations that can be used for computing) fluctuates constantly, so users and applications need automatic help in picking the machines on which to run. o Mailboxes -- provides a transport service for electronic mail. o Terminals -- provides access to a windowing graphics terminal from a computation anywhere in the network. o Accounting -- provides access to a system-wide collection of data on resource usage which can be used for billing and monitoring. In many cases adequate, widely accepted standards already exist for the definition of the base and additional services. Each service must be defined and implemented to provide the five pervasive properties. 3.3 Interfaces Interfaces are the key to making BOB be a coherent and open system. Each of the services is defined by an interface specification that serves as a contract between the service and its clients. The interface defines the operations to be provided, the parameters of each, and the detailed semantics of each relative to a model of the state maintained by the service. The specification is normally represented as an RPC interface definition. Some characterizations of the performance of the operations must be provided (although it is not well understood how to provide precise performance characterizations for operations whose performance varies widely depending on the parameters and/or use history). A precisely defined interface enables interworking across models, versions, and vendors. Several implementations of each interface can exist and this variety allows the system to be heterogeneous in its components. In its interfaces, however, the system is homogeneous. It is this homogeneity that makes it a system with predictable behavior rather than a collection of components that can communicate. If more than one interface exists for the same function, it is unavoidable that the function will work differently through the different interfaces. The system will consequently be more complicated and less reliable. Perhaps some components will not be able to use others at all because they have no interface in common. Certainly customers and salesmen will find it much more difficult to configure workable collections of components, and programmers will not know what services they can depend upon being present. 4.0 Achieving the Global Properties Experience and research have suggested a set of models for achieving global naming, access, security, management, and availability. For each of these pervasive properties, we will consider the general approach that seems most promising. 4.1 Naming Model Every user and every client program sees the entire system as the same tree of named objects. A global name is interpreted by following the named branches in this tree starting from the global root. Every node has a way to find a copy of the root of the global name tree. A hierarchic name space is used because it is the only naming structure we know of that scales well, allows autonomy in the selection of names, and is sufficiently malleable to allow for a long lifetime. A global root for the name space is required to provide each object with a single name that will work from everywhere. Non-hierarchic links can be added where a richer naming structure is required. For each object type there is some service, whose interface is defined by BOB, that provides operations to create and delete objects of that type and to read and change their state. The top part of the naming tree is provided by the BOB name service and the objects near the root of the tree are implemented by the BOB name service. A node in the naming tree, however, can be a junction between the name service and some other service, e.g. a file service. A junction object contains: o a set of servers for the named object o rules for choosing a server name o the service interface ID, e.g. "BOB File Service 2.3" o an object parameter, e.g. a volume identifier To look up a name through a junction, choose a server and call the service interface there with the name and the object parameter. The server looks up the rest of the name. The servers listed in a junction object are designated by global names. To call a service at a server the client must convert the server name to something more useful, like the network address of the server machine and information on which protocols to use in making the call. Looking up a server name in the global name tree produces a server object that contains: o a machine name o a set of communication protocol names A final name lookup maps the (global) machine name into the network address that will be the destination for the actual RPC to the service. The junction machinery can be used at several levels, as appropriate. The junction is a powerful technique for unifying multiple implementations of naming mechanisms within the same hierarchic name space. The following figure gives an example of what the top parts of the global name space might look like, based on the naming scheme of the Internet Domain Name Service. An X.500 service could also provide this part of the name space (or both the Internet DNS and X.500 could be accommodated). An important part of implementing BOB is defining the actual name hierarchy to be used in top levels of the global name space. /com/dec/src/domain ... data for SRC management domain | | | /adb : Principal=(Password=..., Mailbox=..., etc) | | | /bwl : Principal=(Password=..., Mailbox=..., etc) | | | /mds : Principal=(Password=..., Mailbox=..., etc) | | | / ... other principals registered at SRC | | | | | | /staff : Group=(/com/dec/src/adb, /com/dec/src/bwl, etc) | | | / ... other groups at SRC | | | | | | /computers/C1 : Computer=(Address=16.1.0.0, HWInfo=..., | | | Bootfile=..., Bootdata=..., | | | etc) | | | /C2 : Computer=(Address=16.1.0.1, etc) | | | / ... other computers | | | | | | /backmap/16.1.0.0:Path=/com/dec/src/computers/C1 | | | /16.1.0.1:Path=/com/dec/src/computers/C2 | | | / ... IP addresses of other computers | | | | | | /bin : Volume=(junction to file service) | | | /udir : Volume=(junction to file service) | | | / ... other volumes | | | / ... other SRC objects | | | | | / ... other parts of DEC | | | / ... other commercial organizations | / ... other sectors and countries Consider some of the objects named here. /com, /com/dec, and /com/dec/src are directories implemented by the BOB name service. /com/dec/src/adb is a registered user, also an object implemented by the name service. The object /com/dec/src/adb contains a suitably encrypted password, a set of mailbox sites, and other information that is associated with this system user. /com/dec/src/staff is a group of global names. Group objects are provided by BOB's name service to implement things like distribution lists, access control lists, and sets of servers. /com/dec/src/bin is a file system volume. Note that this object is a junction to the BOB file service. The figure does not show the content of this junction object, but it contains a group naming the set of servers implementing this file volume and rules for choosing which one to use, e.g., the first that responds. To look up the name /com/dec/src/bin/ls, for example, the operating system on a client machine traverses the path /com/dec/src/bin using the name service. The result at that point is the content of the junction object, which then allows the client to contact a suitable file server to complete the lookup. The local management autonomy provided by hierarchic names allows system implementors and administrators to build and use their systems without waiting for a planet-wide agreement to be reached about the structure of the first few levels of the global hierarchy. A local hierarchic name space can be constructed that is sufficient to the local need, treating the local root as a global root. Later, this local name space can be incorporated as a subtree in a larger name space. Using a variable to provide the name of the operational root (set to NIL at first) will ease the transition to the larger name space (and shorten the names that people actually use). Another technique is to initially name the local root with an identifier that is unlikely to appear in any future global root; then symbolic links in the global root, or special case code in the local name resolution machinery, can ease the transition. 4.2 Access model Global access means that a program can run anywhere in BOB (on a computer and operating system compatible with the program binary) and get the same result, although the performance may vary depending on the machine chosen. Thus, a program can be executed either on the user's workstation or on a fast cycle server in the machine room, while still accessing the same user files through the same names. Achieving global access requires allowing all elements of the computing environment of a program to be remote from the computer where the program is executing. All services and objects required for a program to run need to be available to a program executing anywhere in the distributed system. For a particular user, "anywhere" includes at least: o on the user's own workstation; o on public workstations or compute servers in the user's domain; o on public workstations in another domain on the user's LAN; o on public workstations across a low-bandwidth WAN. Performance in the first three cases should be similar. Performance in the fourth case is fundamentally limited by the WAN characteristics, although use of caching can make the difference small in many cases. In BOB, global naming and standard services exported via a uniform RPC mechanism provide the keys to achieving global access. All BOB services accept global names for the objects on which they operate. All BOB services are available to remote clients. Thus, any object whose global name is known can be accessed remotely. In addition, programs must access their environments only by using the global names of objects. This last step will require a thorough examination of the computing environment provided by each existing operating system to identify all the ways in which programs can access their environment. For each way identified a mechanism must be designed to provide the global name of the desired object. For example, in Unix systems that operate as part of BOB, the identities of the file system root directory, working directory, and the /tmp directory of a process must be specified by global names. Altering VMS, Unix, and other operating systems to accept global names everywhere will be a major undertaking. Another aspect of global access is making sure that BOB services have operation semantics that are location transparent. Without location transparency in the services it uses, a program will not get the same result when it runs on different computers. A service that allows read/write sharing of object state must provide a data coherence model. The model allows client programs to maintain correct object state and to behave in a manner that does not surprise users, no matter where clients and servers execute. Depending on the nature of the service, it is possible to trade off performance, availability, scale, and coherence. In the name service, for example, it is appropriate to increase performance, availability, and scalability at the expense of coherence. A client update to the name service database can be made by contacting any server. After the client operation has completed, the server propagates the update to object replicas at other servers. Until propagation completes, different clients can read different values for the object. This lack of coherence produces several advantages: it increases performance by limiting the client's wait for update completion; it increases availability by allowing a client to perform an update when just one server is accessible; and it increases scale by making propagation latency separate from the visible latency of the client update operation. For the objects that the name service will store, this lack of coherence is deemed acceptable given the benefits it produces. The data coherence model for the name service defines the loose coherence invariants that programmers can depend upon, thereby meeting the requirement of a coherence model that is insensitive to client and server location. On the other hand, the BOB file service needs to provide consistent write sharing, even at some cost in performance, scale, and availability. Many programs and users are accustomed to using the file system as a communication channel between programs. For example, a programmer may save a source file for a module from an editor and then trigger a recompilation and relinking on a remote cycle server. He will be annoyed if the program is rebuilt with an old version of the module because the cycle server retrieved an old, cached version of the file. File read/write coherence is also important among elements of distributed computations running, say, on multiple computers on the same LAN. The file system coherence model must cover file names and attributes as well as file data. There is diversity of opinion among researchers about the best consistency model for a general-purpose distributed file system. Some feel that an open/close consistency model provides the best tradeoff. With this model changes made by one client are not propagated until that client closes the file and others (re)open it. Others feel that byte-level write sharing is feasible and desirable. With this model clients share the file as though all were accessing it in the same local memory. Successful systems have been built using variants of both models. BOB can be successful based on either model. 4.3 Security model Security is based on three notions: o Authentication -- for every request to do an operation, the name of the user or computer system making the request is known reliably. The source of a request is called a "principal". o Access control -- for every resource (computer, printer, file, database, etc.) and every operation on that resource (read, write, delete, etc.), it is possible to specify the names of the principals allowed to do that operation on that resource. Every request for an operation is checked to ensure that its principal is allowed to do that operation. o Auditing -- every access to a resource can be logged if desired, as can the evidence used to authenticate every request. If trouble comes up, there is a record of exactly what happened. To authenticate a request as coming from a particular principal, the system must determine that the principal originated the request, and that it was not modified on the way to its destination. We do the latter by establishing a "secure channel" between the system that originates the request and the one that carries it out. Practical security in a distributed system requires encryption to secure the communication channels. The encryption must not slow down communication, since in general it is too hard to be sure that a particular message does not need to be encrypted. So the security architecture should include methods of doing encryption and decryption on the fly, as data flows from computers into the network and back. To determine who originated a request, it is necessary to know who is on the other end of the secure channel. Usually this is done by having the principal at the other end demonstrate that it knows some secret (such as a password), and then finding out in a reliable way the name of the principal that knows that secret. BOB's security architecture needs to specify how to do both these things. It is best if a principal can show that it knows the secret without giving it away, since otherwise the system can later impersonate the principal. Password-based schemes reveal the secret, but schemes based on encryption do not. It is desirable to authenticate a user by his possession of a device which knows his secret and can demonstrate this by encryption. Such a device is called a "smart card". An inferior alternative is for the user to type his password to a trusted agent. To authenticate a computer system, we need to be sure that it has been properly loaded with a good operating system image; BOB must specify methods to ensure this. Security depends on naming, since access control identifies the principals that are allowed access by name. Practical security also depends on being able to have groups of principals e.g., the Executive Committee, or the system administrators for the cluster named "Star". Both these facilities must be provided by the name service. To ensure that the names and groups are defined reliably, digital signatures are used to certify information in the name service; the signatures are generated by a special "certification authority" which is engineered for high reliability and kept off-line, perhaps in a safe, when its services are not needed. Authentication depends only on the smallest sub-tree of the full naming tree that includes both the requesting principal and the resource; certification authorities that are more remote are assumed to be less trusted. Security also depends on time. Authentication, access control, and secure channels require correct timestamps and clocks. The time service must distribute time securely. 4.4 Management model System management is the adjustment of system state by a human manager. Management is needed when satisfactory algorithmic adjustments cannot be provided -- when human judgement is required. The problem in a large-scale distributed system is to provide each system manager with the means to monitor and adjust a fairly large collection of different types of geographically distributed components. Of the five persuasive properties of BOB, global management is the one we understand least well how to achieve. The facilities, however, are likely to be structured along the following lines. The BOB management model is based on the concept of domains. Every component in a distributed system is assigned to a domain. (A component is a piece of equipment or a piece of management-relevant object state.) Each domain has a responsible system manager. In the simple version of the model (that is probably adequate for most, perhaps all, systems) domains are disjoint and managers are disjoint, although more complex arrangements are possible, e.g. overlapping domains, a hierarchy of managers. Ideally a domain would not depend on any other domains for its correct operation. There needs to be quite a bit of flexibility available for defining domains, as different arrangements will be effective in different installations. Example domains include: o components used by a group of people with common goals; o components that a group of users expects to find working; o largest pile of components under one system manager; o arbitrary pile of components that is not too big. As a practical matter, customers will require guidelines for defining effective management domains. BOB requires that each component define and export a management interface, using RPC if possible. Each component is managed via calls to this interface from interactive tools run by human managers. Some requirements for the management interface of a component are: o Remote access -- The management interface provides remote access to all management functions. Local error logs are maintained that can be read from the management interface. A secure channel is provided from management tools to the interface and the operations are subject to authentication and access control. No running around by the manager is required. o Program interface -- Each component's management interface is designed to be driven by a program, not a person. Actual invocation of management functions is by RPC calls from management tools. This allows a manager to do a lot with a little typing. A good management interface provides end-to-end checks to verify successful completion of a series of complex actions and provides operations that are independent of initial component state to make it easier to achieve the desired final state. o Relevance -- The management interface operates only on management- relevant state. In places where the flexibility is useful rather than just confusing, the management interface permits decentralized management by individual users. o Uniformity -- Different kinds of components should strive for uniformity in their management interfaces. This allows a single manager to retain intellectual control of a larger number of kinds of components. The management interfaces and tools make it practical for one person to manage large domains. An interactive management tool can invoke the management interfaces of all components in a domain. It provides suitable ways to display and correlate the data, and to change the management-relevant state of components. Management tools are capable of making the same state change in a large set of similar components in a domain via iterative calls. To provide the flexibility to invent new management operations, some management tools support the construction of programs that call the management interfaces of domain components. 4.5 Availability Model To achieve high availability of a service there must be multiple servers for that service. If these servers are structured to fail independently, then any desired degree of availability can be achieved by adjusting the degree of replication. The most practical scheme for replication of the services in BOB is primary/backup, in which a client uses one server at a time and servers are arranged (as far as possible) to fail in a benign way, say by stopping. The alternative method, called active replication, has the client perform each operation at several servers. Active replication uses more resources than primary/backup but has no failover delays and can tolerate arbitrary failure behavior by servers. To see how primary/backup works, recall from the naming model discussion that the object that represents a service includes a set of servers and some rules for choosing one. If the chosen server (the primary) fails, then the client can failover to another server (the backup) and repeat the operation. The client assumes failure if a response to an operation does not occur within a timeout period. The timeout should be as short as possible so that the latency of an operation that fails over is comparable to the usual latency. In practice, setting good timeouts is hard and fixed timeouts may not be adequate. If clients can do operations that change long-term state then the primary server must keep the backup servers up-to-date. To achieve transparent failover from the point of view of client programs, knowledge of the multiple servers should be encapsulated in an agent on the client computer. In this chapter we refer to the agent as a clerk. The clerk can export a logically centralized, logically local service to the client program, even when the underlying service implementation is distributed, replicated, and remote. The clerk software can have many different structural relationships to its client. In simple cases it can be runtime libraries loaded into the client address space and be invoked with local procedure calls. Or it may operate in a separate address space on the same machine as the client and be invoked by same-machine IPC, same-machine RPC, or callbacks from the operating system. Or it can be in the operating system itself. The clerk interface need not be the same as the server interface. Indeed, the server interface usually will be significantly more complex. In addition to implementing server selection and failover, a clerk may provide caching and write behind to improve performance, and can implement aggregate operations that read-modify-write server state. As simple examples of caching, a name service clerk might remember the results of recently looked up names and maintain open connections to frequently used name servers, or a file service clerk might cache the results of recent file directory and file data reads. Write-behind allows a clerk to batch several updates as a single server operation which can be done asynchronously, thus reducing the latency of operations at the clerk interface. Implementing a client's read-modify-write operation might require the clerk to use complex retry strategies involving several server operations when a failover occurs at some intermediate step. As an example of how a clerk masks the existence of multiple servers, consider the actions involved in listing the contents of /com/dec/src/udir/bwl/Mail/inbox, a BOB file system directory. The client program presents the entire path name to the file service clerk. The clerk locates a name server that stores the root directory and presents the complete name. That server may store the directories down to, say, /com/dec. The directory entry for /com/dec/src will indicate another set of servers. So the first lookup operation will return the new server set and the remaining unresolved path name. The clerk will then contact a server in the new set and present the unresolved path src/udir/bwl/Mail/inbox. This server discovers that src/udir is a junction to a file system volume, so returns the junction information and the unresolved path name udir/bwl/Mail/inbox. Finally, the clerk uses the junction information to contact a file server, which in this example actually stores the target directory and responds with the directory contents. What looks like a single operation to the client program actually involves RPCs to three different servers by the clerk. This example of a name resolution would be completed with fewer or no operations at remote servers if the clerk has a cache that already contains the necessary information. In practice, caches are part of most clerk implementations and most operations are substantially speeded by the presence of cached data. Other issues arise when implementing a high-availability service with long term state. The servers must cooperate to maintain consistent state among themselves, so a backup can take over reasonably quickly. Problems to be solved include arranging that no writes get lost during failover from one server to another and that a server that has been down can recover the current state when it is restarted. Combining these requirements with caching and write-behind to obtain good performance, without sacrificing consistent sharing, can make implementing a highly available service quite challenging. 5. Conclusion This chapter covers the inherent strengths of centralized and networked computing systems. It outlines the structure and properties of BOB, a state-of-the-art distributed computing base for supporting general-purpose computing. This system combines the best features of centralized and networked systems with recent advances in security and availability to produce a powerful, cost-effective, easy-to-use computing environment. Getting systems like BOB into widespread use will be hard. Given the state-of-the-art in distributed systems technology, building a prototype for proof of concept is certainly feasible. But the only practical method for getting widespread use of systems with these properties is to figure out ways to approach the goal by making incremental improvements to existing networked systems. This requires producing a sequence of useful, palatable changes that lead all the way to the goal. 6. Acknowledgements The material in this chapter was jointly developed by Michael Schroeder, Butler Lampson, and Andrew Birrell. The ideas explored here aggregate many years of experience of the designers, builders, and users of distributed systems. Colleagues at SRC and fellow authors of this book have provided useful suggestions on the presentation.