A State-of-the-Art Distributed System: Computing with BOB

This paper appears as Chapter 1 in the book Distributed Systems,

edited by Sape Mullender and published by Addison-Wesley and ACM Press

in 1993, 1994, and 1995.



Copyright 1993 by the ACM Press.





 



A State-of-the-Art Distributed System: Computing with BOB





Michael D. Schroeder

DEC Systems Research Center

130 Lytton Ave

Palo Alto, CA 94301



mds@pa.dec.com







Distributed systems are a popular and powerful computing paradigm. 

Yet existing examples have serious short-comings as a base for general-purpose 

computing. This chapter explores the characteristics, strengths, 

and weaknesses of distributed systems. It describes a model for a 

state-of-the-art distributed system that can do a better job of supporting 

general-purpose computing than existing systems. This model system 

combines the best features of centralized and networked systems, 

with improved availability and security. The chapter concludes by 

outlining the technical approaches that appear most promising for 

structuring such a system.



This chapter does not specifically discuss application-specific distributed 

systems, such as automated banking systems, control systems for roaming 

and tracking cellular telephones, and retail point-of-sales systems, 

although there are many economically important examples. The issues 

raised in the chapter do apply to such systems and the model system 

described here should be a suitable base for many of them. Some systems 

that are designed to support a narrow set of uses, however, may need 

to be structured in application-specific ways that can be simpler 

and more efficient for those uses. The extent to which special-purpose 

systems should be built on a general-purpose distributed base needs 

to be investigated further. This chapter also does not address real-time 

distributed systems, such as control systems for factories, aircraft, 

or automobiles, which face distinctive scheduling and resource utilization 

requirements. 





1. Characteristics of Distributed Systems



A distributed system is several computers doing something together. 

Thus, a distributed system has three primary characteristics. 

 

o  Multiple Computers -- A distributed system contains more than 

   one physical computer, each consisting of cpu's, some local memory, 

   possibly some stable storage like disks, and I/O paths to connect 

   it with the environment. 



o  Interconnections -- Some of the I/O paths will interconnect the 

   computers. If they cannot talk to each other, then it is not going 

   to be a very interesting distributed system. 

   

o  Shared State -- The computers cooperate to maintain some shared 

   state. Put another way, if the correct operation of the system 

   is described in terms of some global invariants, then maintaining 

   those invariants requires the correct and coordinated operation 

   of multiple computers. 



Building a system out of interconnected computers requires that three 

major issues be addressed. 



o  Independent Failure -- Because there are several distinct computers 

   involved, when one breaks others may keep going. Often we want 

   the "system" to keep working after one or more have failed. 

   

o  Unreliable Communication -- Because in most cases the interconnections 

   among computers are not confined to a carefully controlled environment, 

   they will not work correctly all the time. Connections may be 

   unavailable. Messages may be lost or garbled. One computer cannot 

   count on being able to communicate clearly with another, even 

   if both are working. 

  

o  Insecure Communication -- The interconnections among the computers 

   may be exposed to unauthorized eavesdropping and message modification. 



A centralized system that supports multiple processes and provides 

some form of interprocess communication, e.g. a Unix timesharing 

system, can exhibit in virtual form the three primary characteristics 

of a distributed system. A centralized system may even manifest independent 

failure, e.g. the demon process for mail transport may crash without 

stopping interactive user processes on the same system. Thus, design 

and programming techniques associated with communicating sequential 

processes in centralized systems form part of the basic techniques 

in the distributed systems arena. However, centralized systems are 

usually successful without dealing with independent failure and never 

need to confront unreliable and insecure communications. In a distributed 

system all three issues must be addressed. 



The canonical example of a general-purpose distributed system today 

is a networked system -- a set of workstations/PCs and servers interconnected 

with a network. 





2. Networked vs Centralized Systems



It is easy to understand why networked systems are popular. Such 

systems allow the sharing of information and resources over a wide 

geographic and organizational spread. They allow the use of small, 

cost-effective computers and get the computing cycles close to the 

data. They can grow in small increments over a large range of sizes. 

They allow a great deal of autonomy through separate component purchasing 

decisions, selection of multiple vendors, use of multiple software 

versions, and adoption of multiple management policies. Finally, 

they do not necessarily all crash at once. Thus, in the areas of 

sharing, cost, growth, and autonomy, networked systems are better 

than traditional centralized systems as exemplified, say, by timesharing. 



On the other hand, centralized systems do some things better than 

today's networked systems. All information and resources in a centralized 

system are equally accessible. Functions work the same way and objects 

are named the same way everywhere in a centralized system. And a 

centralized system is easier to manage. So despite the advantages 

of networked systems, centralized systems are often easier to use 

because they are more accessible, coherent, and manageable. 



In the areas of security and availability, the comparison between 

networked systems and centralized systems produces no clear-cut advantage 

for either. 



2.1 Security



In usual practice to date, neither centralized nor networked systems 

offer real security, but for different reasons. 



A centralized system has a single security domain under the control 

of a single authority. The trusted computing base is contained in 

a single operating system. That operating system executes on a single 

computer that can have a physically secure environment. With all 

the security eggs in one basket, so to speak, users understand the 

level of trust to assign to the system and who to speak to when problems 

arise. On the other hand, it is notoriously difficult to eliminate 

all the security flaws from an operating system or from the operating 

environment, and with a single security domain one such flaw can 

be exploited to break the security of the entire system. 



Networked systems have multiple security domains and thus exhibit 

the inverse set of security properties. The trusted computing base 

is scattered among many components that operate in environments with 

varying degrees of physical security, differing security policies, 

and possibly under different authorities. The interconnections among 

ther computers are physically insecure. It is hard to know what is 

being trusted and what can be trusted. But, because the system contains 

many computers, exploiting a security flaw in the software or environment 

of one computer does not automatically compromise the entire system. 



2.2 Availability



A similar two-sided analysis applies to availability. A centralized 

system can have a controlled physical and operational environment. 

Since a high proportion of system failures are the result of operational 

and environmental factors, careful management of this single environment 

can produce good availability. But when something does go wrong 

the whole system goes down at once, stopping all users from getting 

work done. 



In a networked system the various computers fail independently. However, 

it is often the case that several computers must be in operation 

simultaneously before a user can get work done, so the probability 

of the system failing is greater than the probability of one component 

failing. This increased probability of not working, compared to a 

centralized system, is the result of ignoring independent failure. 

The consequence is Leslie Lamport's definition of a distributed system: 

"You know you have one when the crash of a computer you've never 

heard of stops you from getting any work done." 



On the other hand, independent failure in a distributed system can 

be exploited to increase availability and reliability. When independent 

failure is properly harnessed by replicating functions on independent 

components, multiple component failures are required before system 

availability and reliability suffer. The probability of the system 

failing thus can be less than the probability in a centralized system. 

Dealing with independent failure to avoid making availability worse, 

or even to make it better, is a major task for designers of distributed 

systems. 



A distributed system also must cope with communication failures. 

Unreliable communication not only contributes to unavailability, 

it can lead to incorrect functioning. A computer cannot reliably 

distinguish a down neighbor from a disconnected neighbor and therefore 

can never be sure an unresponsive neighbor has actually stopped. 

Maintaining global state invariants in such circumstances is tricky. 

Careful design is required to actually achieve correct operation 

and high availability using replicated components. 



2.3 A State-of-the-Art Distributed System



It seems feasible to develop a distributed system that combines the 

accessibility, coherence, and manageability advantages of centralized 

systems with the sharing, growth, cost, and autonomy advantages of 

networked systems. If real security and high availability were added 

to the mix, then we would have a state-of-the-art computing base 

for many purposes. Achieving this combination of features in a single 

system is the central challenge of supporting general-purpose computing 

well with a distributed system. No existing system fulfills this 

ideal. 





3. The Properties and Services Model



We can describe this best-of-both-worlds (BOB) distributed computing 

base in terms of a model of its properties and services. This more 

technical description of the goals provides a structure that will 

help us to understand the mechanisms needed to achieve them. 

 

The properties and services model defines BOB as: 



o  a heterogeneous set of hardware, software, and data components; 



o  whose size and geographic extent can vary over a large range; 



o  connected by a network; 



o  providing a uniform set of services (naming, remote invocation, 

   user registration, time, files, etc); 

     

o  with certain global properties (names, access, security, management 

   and availability). 



Because we are talking about a base for general-purpose computing, 

the model is defined in terms most appropriate to understanding by 

the programmers who develop components that are part of the base 

and who develop the many different applications that are to be built 

on top of it. But the fundamental coherence provided by the model 

will show through such components and applications (when they are 

correctly implemented) to provide a coherent system as viewed by 

its human users too. 



The coherence that makes BOB a system rather than a collection of 

computers is a result of its uniform services and global properties. 

The services are available in the same way to every part of the system 

and the properties allow every part of the system to be viewed in 

the same way, regardless of system size. Designers are well aware 

of the care that must be taken to produce implementations that can 

support growth to very large sizes (tens or hundreds of thousands 

of nodes). A similar challenge exists in making such expandable systems 

light-weight and simple enough to be suitable for very small configurations 

too. 



BOB's coherence does not mean that all the components of the system 

must be the same. The model applies to a heterogeneous collection 

of computers running operating systems such as Unix, VMS, MS-DOS, 

Windows NT, and others. In short, all platforms can operate in this 

framework, even computers and systems from multiple vendors. The 

underlying network can be a collection of local area network segments, 

bridges, routers, gateways, and various types of long distance services 

with connectivity provided by various transport protocols. 



3.1 Properties



What do BOB's pervasive properties mean in more detail? 



o  Global names -- the same names work everywhere. Machines, users, 

   files, distribution lists, access control groups, and services 

   have full names that mean the same thing regardless of where in 

   the system the names are used. For instance, Butler Lampson's 

   user name might be something like /com/dec/src/bwl throughout 

   the system. He will operate under that name when using any computer. 

   Global naming underlies the ability to share things. 



o  Global access -- the same functions are usable everywhere with 

   reasonable performance. If Butler sits down at a machine when 

   visiting in California, he can do everything there that he can 

   do when in his usual office in Massachusetts, with perhaps some 

   performance degradations. For instance, from Palo Alto Butler 

   could command the local printing facilities to print a file stored 

   on his computer in Cambridge. Global access also includes the 

   idea of data coherence. Suppose Butler is in Cambridge on the 

   phone to Mike Schroeder in Palo Alto and Butler makes a change 

   to a file and writes it. Mike should be able to read the new version 

   as soon as Butler thinks he has saved it. Neither Mike nor Butler 

   should have to take any special action to make this possible. 



o  Global security -- the same user authentication and access control 

   work everywhere. For instance, Butler can authenticate himself 

   to any computer in the system; he can arrange for data transfer 

   secure from eavesdropping and modification between any two computers; 

   and assuming that the access control policy permits it, Butler 

   can use exactly the same mechanism to let the person next door 

   and someone from another site read his files. All the facilities 

   that require controlled access (logins, files, printers, management 

   functions, etc.) use the same machinery to provide access control. 



o  Global management -- the same person can manage components anywhere. 

   Obviously one person will not manage all of a large system. But 

   the system should not impose a priori constraints on which set 

   of components a single person can manage. All of the components 

   of the system provide a common interface to management tools. 

   The tools allow a manager to perform the same action on large 

   numbers of components at once. For instance, a single system manager 

   can configure all the workstations in an organization without 

   leaving his office. 



o  Global availability -- the same services work even after some 

   failures. System managers get to decide (and pay for) the level 

   of replication for each service. As long as the failures do not 

   exceed the redundancy provided, each service will go on working. 

   For instance, a group might decide to duplicate its file servers 

   but get by with one printer per floor. System-wide policy might 

   dictate a higher level of replication for the underlying communication 

   network. Mail does not have to fail between Palo Alto and Cambridge 

   just because some machine goes down in Lafayette, Indiana. 



3.2 Services



The standard services defined by BOB include the following fundamental 

facilities: 



o  Names -- provides access to a replicated, distributed database 

   of global names and associated values for machines, users, files, 

   distribution lists, access control groups, and services. A name 

   service is the key BOB component for providing global names, although 

   most of the work involved in implementing global names is making 

   all the other components of the distributed system, e.g. existing 

   operating systems, use the name service in a consistent way. 



o  Remote procedure call -- provides a standard way to define and 

   securely invoke service interfaces. This allows service instances 

   to be local or remote. The RPC mechanism can be organized to operate 

   by dynamically choosing one of a variety of transport protocols. 

   Choosing RPC for the standard service invocation mechanism does 

   not force blocking call semantics on all programs. RPC defines 

   a way to match response messages with request messages. It does 

   not require that the calling program block to await a response. 

   Methods for dealing with asynchrony inside a single program are 

   a local option. Blocking on RPC calls is a good choice when the 

   local environment provides multiple threads per address space. 



o  User registrations -- allows users to be registered and authenticated 

   and issues certificates permitting access to system resources 

   and information. 



o  Time -- distributes consistent and accurate time globally. 



o  Files -- provides access to a replicated, distributed, global 

   file system. Each component machine of BOB can make available 

   the files it stores locally through this standard interface. The 

   multiple file name spaces are connected by the name service. The 

   file service specification should include standard presentations 

   for the different VMS, Unix, etc. file types. For example, all 

   implementations should support a standard view of any file as 

   an array of bytes. 



o  Management -- provides access to the management data and operations 

   of each component. 



In addition to these base level facilities, BOB can provide other 

services appropriate to the intended applications, such as:



o  Records -- provides access to records, either sequentially or 

   via indexes, with record locking to allow concurrent reading and 

   writing, and journaling to preserve integrity after a failure. 



o  Printers -- allows printing throughout the network of documents 

   in standard formats such as Postscript and ANSI, including job 

   control and scheduling. 



o  Execution -- allows programs to be run on any machine (or set 

   of machines) in the network, subject to access and resource controls, 

   and efficiently schedules both interactive and batch jobs on available 

   machines, taking account of priorities, quotas, deadlines, and 

   failures. The exact configuration and utilization of cycle servers 

   (as well as idle workstations that can be used for computing) 

   fluctuates constantly, so users and applications need automatic 

   help in picking the machines on which to run. 



o  Mailboxes -- provides a transport service for electronic mail. 



o  Terminals -- provides access to a windowing graphics terminal 

   from a computation anywhere in the network. 



o  Accounting -- provides access to a system-wide collection of data 

   on resource usage which can be used for billing and monitoring. 



In many cases adequate, widely accepted standards already exist for 

the definition of the base and additional services. Each service 

must be defined and implemented to provide the five pervasive properties. 



3.3 Interfaces

  

Interfaces are the key to making BOB be a coherent and open system. 

Each of the services is defined by an interface specification that 

serves as a contract between the service and its clients. The interface 

defines the operations to be provided, the parameters of each, and 

the detailed semantics of each relative to a model of the state maintained 

by the service. The specification is normally represented as an RPC 

interface definition. Some characterizations of the performance of 

the operations must be provided (although it is not well understood 

how to provide precise performance characterizations for operations 

whose performance varies widely depending on the parameters and/or 

use history). A precisely defined interface enables interworking across 

models, versions, and vendors. Several implementations of each interface 

can exist and this variety allows the system to be heterogeneous 

in its components. In its interfaces, however, the system is homogeneous. 



It is this homogeneity that makes it a system with predictable behavior 

rather than a collection of components that can communicate. If more 

than one interface exists for the same function, it is unavoidable 

that the function will work differently through the different interfaces. 

The system will consequently be more complicated and less reliable. 

Perhaps some components will not be able to use others at all because 

they have no interface in common. Certainly customers and salesmen 

will find it much more difficult to configure workable collections 

of components, and programmers will not know what services they can 

depend upon being present. 





4.0 Achieving the Global Properties



Experience and research have suggested a set of models for achieving 

global naming, access, security, management, and availability. For 

each of these pervasive properties, we will consider the general 

approach that seems most promising. 



4.1 Naming Model



Every user and every client program sees the entire system as the 

same tree of named objects. A global name is interpreted by following 

the named branches in this tree starting from the global root. Every 

node has a way to find a copy of the root of the global name tree. 



A hierarchic name space is used because it is the only naming structure 

we know of that scales well, allows autonomy in the selection of 

names, and is sufficiently malleable to allow for a long lifetime. 

A global root for the name space is required to provide each object 

with a single name that will work from everywhere. Non-hierarchic 

links can be added where a richer naming structure is required. 



For each object type there is some service, whose interface is defined 

by BOB, that provides operations to create and delete objects of 

that type and to read and change their state. 



The top part of the naming tree is provided by the BOB name service 

and the objects near the root of the tree are implemented by the 

BOB name service. A node in the naming tree, however, can be a junction 

between the name service and some other service, e.g. a file service. 

A junction object contains: 



o  a set of servers for the named object

o  rules for choosing a server name

o  the service interface ID, e.g. "BOB File Service 2.3"

o  an object parameter, e.g. a volume identifier



To look up a name through a junction, choose a server and call the 

service interface there with the name and the object parameter. The 

server looks up the rest of the name.



The servers listed in a junction object are designated by global 

names. To call a service at a server the client must convert the 

server name to something more useful, like the network address of 

the server machine and information on which protocols to use in making 

the call. Looking up a server name in the global name tree produces 

a server object that contains: 



o  a machine name

o  a set of communication protocol names



A final name lookup maps the (global) machine name into the network 

address that will be the destination for the actual RPC to the service.



The junction machinery can be used at several levels, as appropriate. 

The junction is a powerful technique for unifying multiple implementations 

of naming mechanisms within the same hierarchic name space.



The following figure gives an example of what the top parts of the 

global name space might look like, based on the naming scheme of 

the Internet Domain Name Service. An X.500 service could also provide 

this part of the name space (or both the Internet DNS and X.500 could 

be accommodated). An important part of implementing BOB is defining 

the actual name hierarchy to be used in top levels of the global 

name space. 





   /com/dec/src/domain ... data for SRC management domain

   |   |   |   /adb : Principal=(Password=..., Mailbox=..., etc)

   |   |   |   /bwl : Principal=(Password=..., Mailbox=..., etc)

   |   |   |   /mds : Principal=(Password=..., Mailbox=..., etc)

   |   |   |   / ... other principals registered at SRC

   |   |   |    

   |   |   |   /staff : Group=(/com/dec/src/adb, /com/dec/src/bwl, etc)

   |   |   |   / ... other groups at SRC

   |   |   |    

   |   |   |   /computers/C1 : Computer=(Address=16.1.0.0, HWInfo=...,

   |   |   |                            Bootfile=..., Bootdata=...,

   |   |   |                            etc)

   |   |   |             /C2 : Computer=(Address=16.1.0.1, etc)

   |   |   |             / ... other computers

   |   |   |    

   |   |   |   /backmap/16.1.0.0:Path=/com/dec/src/computers/C1

   |   |   |           /16.1.0.1:Path=/com/dec/src/computers/C2

   |   |   |           / ... IP addresses of other computers

   |   |   |    

   |   |   |   /bin : Volume=(junction to file service)

   |   |   |   /udir : Volume=(junction to file service)

   |   |   |   / ... other volumes

   |   |   |   / ... other SRC objects

   |   |   |

   |   |   / ... other parts of DEC

   |   |

   |   / ... other commercial organizations

   |

   / ... other sectors and countries





Consider some of the objects named here. /com, /com/dec, and /com/dec/src 

are directories implemented by the BOB name service. /com/dec/src/adb 

is a registered user, also an object implemented by the name service. 

The object /com/dec/src/adb contains a suitably encrypted password, 

a set of mailbox sites, and other information that is associated 

with this system user. 



/com/dec/src/staff is a group of global names. Group objects are 

provided by BOB's name service to implement things like distribution 

lists, access control lists, and sets of servers. /com/dec/src/bin 

is a file system volume. Note that this object is a junction to the 

BOB file service. The figure does not show the content of this junction 

object, but it contains a group naming the set of servers implementing 

this file volume and rules for choosing which one to use, e.g., the 

first that responds. To look up the name /com/dec/src/bin/ls, for 

example, the operating system on a client machine traverses the path 

/com/dec/src/bin using the name service. The result at that point 

is the content of the junction object, which then allows the client 

to contact a suitable file server to complete the lookup. 



The local management autonomy provided by hierarchic names allows 

system implementors and administrators to build and use their systems 

without waiting for a planet-wide agreement to be reached about the 

structure of the first few levels of the global hierarchy. A local 

hierarchic name space can be constructed that is sufficient to the 

local need, treating the local root as a global root. Later, this 

local name space can be incorporated as a subtree in a larger name 

space. Using a variable to provide the name of the operational root 

(set to NIL at first) will ease the transition to the larger name 

space (and shorten the names that people actually use). Another technique 

is to initially name the local root with an identifier that is unlikely 

to appear in any future global root; then symbolic links in the global 

root, or special case code in the local name resolution machinery, 

can ease the transition. 



4.2 Access model



Global access means that a program can run anywhere in BOB (on a 

computer and operating system compatible with the program binary) 

and get the same result, although the performance may vary depending 

on the machine chosen. Thus, a program can be executed either on 

the user's workstation or on a fast cycle server in the machine room, 

while still accessing the same user files through the same names. 



Achieving global access requires allowing all elements of the computing 

environment of a program to be remote from the computer where the 

program is executing. All services and objects required for a program 

to run need to be available to a program executing anywhere in the 

distributed system. For a particular user, "anywhere" includes at 

least: 



o  on the user's own workstation;

o  on public workstations or compute servers in the user's domain;

o  on public workstations in another domain on the user's LAN;

o  on public workstations across a low-bandwidth WAN.



Performance in the first three cases should be similar. Performance 

in the fourth case is fundamentally limited by the WAN characteristics, 

although use of caching can make the difference small in many cases. 



In BOB, global naming and standard services exported via a uniform 

RPC mechanism provide the keys to achieving global access. All BOB 

services accept global names for the objects on which they operate. 

All BOB services are available to remote clients. Thus, any object 

whose global name is known can be accessed remotely. 



In addition, programs must access their environments only by using 

the global names of objects. This last step will require a thorough 

examination of the computing environment provided by each existing 

operating system to identify all the ways in which programs can access 

their environment. For each way identified a mechanism must be designed 

to provide the global name of the desired object. For example, in 

Unix systems that operate as part of BOB, the identities of the file 

system root directory, working directory, and the /tmp directory 

of a process must be specified by global names. Altering VMS, Unix, 

and other operating systems to accept global names everywhere will 

be a major undertaking. 



Another aspect of global access is making sure that BOB services 

have operation semantics that are location transparent. Without location 

transparency in the services it uses, a program will not get the 

same result when it runs on different computers. A service that allows 

read/write sharing of object state must provide a data coherence 

model. The model allows client programs to maintain correct object 

state and to behave in a manner that does not surprise users, no 

matter where clients and servers execute. Depending on the nature 

of the service, it is possible to trade off performance, availability, 

scale, and coherence. 



In the name service, for example, it is appropriate to increase performance, 

availability, and scalability at the expense of coherence. A client 

update to the name service database can be made by contacting any 

server. After the client operation has completed, the server propagates 

the update to object replicas at other servers. Until propagation 

completes, different clients can read different values for the object. 

This lack of coherence produces several advantages: it increases 

performance by limiting the client's wait for update completion; 

it increases availability by allowing a client to perform an update 

when just one server is accessible; and it increases scale by making 

propagation latency separate from the visible latency of the client 

update operation. For the objects that the name service will store, 

this lack of coherence is deemed acceptable given the benefits it 

produces. The data coherence model for the name service defines the 

loose coherence invariants that programmers can depend upon, thereby 

meeting the requirement of a coherence model that is insensitive 

to client and server location. 



On the other hand, the BOB file service needs to provide consistent 

write sharing, even at some cost in performance, scale, and availability. 

Many programs and users are accustomed to using the file system as 

a communication channel between programs. For example, a programmer 

may save a source file for a module from an editor and then trigger 

a recompilation and relinking on a remote cycle server. He will be 

annoyed if the program is rebuilt with an old version of the module 

because the cycle server retrieved an old, cached version of the 

file. File read/write coherence is also important among elements 

of distributed computations running, say, on multiple computers on 

the same LAN. The file system coherence model must cover file names 

and attributes as well as file data. 



There is diversity of opinion among researchers about the best consistency 

model for a general-purpose distributed file system. Some feel that 

an open/close consistency model provides the best tradeoff. With 

this model changes made by one client are not propagated until that 

client closes the file and others (re)open it. Others feel that byte-level 

write sharing is feasible and desirable. With this model clients 

share the file as though all were accessing it in the same local 

memory. Successful systems have been built using variants of both 

models. BOB can be successful based on either model. 



4.3 Security model



Security is based on three notions:



o  Authentication -- for every request to do an operation, the name 

   of the user or computer system making the request is known reliably. 

   The source of a request is called a "principal". 



o  Access control -- for every resource (computer, printer, file, 

   database, etc.) and every operation on that resource (read, write, 

   delete, etc.), it is possible to specify the names of the principals 

   allowed to do that operation on that resource. Every request for 

   an operation is checked to ensure that its principal is allowed 

   to do that operation. 



o  Auditing -- every access to a resource can be logged if desired, 

   as can the evidence used to authenticate every request. If trouble 

   comes up, there is a record of exactly what happened. 



To authenticate a request as coming from a particular principal, 

the system must determine that the principal originated the request, 

and that it was not modified on the way to its destination. We do 

the latter by establishing a "secure channel" between the system 

that originates the request and the one that carries it out. Practical 

security in a distributed system requires encryption to secure the 

communication channels. The encryption must not slow down communication, 

since in general it is too hard to be sure that a particular message 

does not need to be encrypted. So the security architecture should 

include methods of doing encryption and decryption on the fly, as 

data flows from computers into the network and back. 



To determine who originated a request, it is necessary to know who 

is on the other end of the secure channel. Usually this is done by 

having the principal at the other end demonstrate that it knows some 

secret (such as a password), and then finding out in a reliable way 

the name of the principal that knows that secret. BOB's security 

architecture needs to specify how to do both these things. It is 

best if a principal can show that it knows the secret without giving 

it away, since otherwise the system can later impersonate the principal. 

Password-based schemes reveal the secret, but schemes based on encryption 

do not. 



It is desirable to authenticate a user by his possession of a device 

which knows his secret and can demonstrate this by encryption. Such 

a device is called a "smart card". An inferior alternative is for 

the user to type his password to a trusted agent. To authenticate 

a computer system, we need to be sure that it has been properly loaded 

with a good operating system image; BOB must specify methods to ensure 

this. 



Security depends on naming, since access control identifies the principals 

that are allowed access by name. Practical security also depends 

on being able to have groups of principals e.g., the Executive Committee, 

or the system administrators for the cluster named "Star". Both these 

facilities must be provided by the name service. To ensure that the 

names and groups are defined reliably, digital signatures are used 

to certify information in the name service; the signatures are generated 

by a special "certification authority" which is engineered for high 

reliability and kept off-line, perhaps in a safe, when its services 

are not needed. Authentication depends only on the smallest sub-tree 

of the full naming tree that includes both the requesting principal 

and the resource; certification authorities that are more remote 

are assumed to be less trusted. 



Security also depends on time. Authentication, access control, and 

secure channels require correct timestamps and clocks. The time service 

must distribute time securely. 



4.4 Management model



System management is the adjustment of system state by a human manager. 

Management is needed when satisfactory algorithmic adjustments cannot 

be provided -- when human judgement is required. The problem in a 

large-scale distributed system is to provide each system manager 

with the means to monitor and adjust a fairly large collection of 

different types of geographically distributed components. Of the 

five persuasive properties of BOB, global management is the one we 

understand least well how to achieve. The facilities, however, are 

likely to be structured along the following lines. 



The BOB management model is based on the concept of domains. Every 

component in a distributed system is assigned to a domain. (A component 

is a piece of equipment or a piece of management-relevant object 

state.) Each domain has a responsible system manager. In the simple 

version of the model (that is probably adequate for most, perhaps 

all, systems) domains are disjoint and managers are disjoint, although 

more complex arrangements are possible, e.g. overlapping domains, 

a hierarchy of managers. Ideally a domain would not depend on any 

other domains for its correct operation. 



There needs to be quite a bit of flexibility available for defining 

domains, as different arrangements will be effective in different 

installations. Example domains include: 



o  components used by a group of people with common goals;

o  components that a group of users expects to find working;

o  largest pile of components under one system manager;

o  arbitrary pile of components that is not too big.



As a practical matter, customers will require guidelines for defining 

effective management domains. 



BOB requires that each component define and export a management interface, 

using RPC if possible. Each component is managed via calls to this 

interface from interactive tools run by human managers. Some requirements 

for the management interface of a component are: 



o  Remote access -- The management interface provides remote access 

   to all management functions. Local error logs are maintained that 

   can be read from the management interface. A secure channel is 

   provided from management tools to the interface and the operations 

   are subject to authentication and access control. No running around 

   by the manager is required. 



o  Program interface -- Each component's management interface is 

   designed to be driven by a program, not a person. Actual invocation 

   of management functions is by RPC calls from management tools. 

   This allows a manager to do a lot with a little typing. A good 

   management interface provides end-to-end checks to verify successful 

   completion of a series of complex actions and provides operations 

   that are independent of initial component state to make it easier 

   to achieve the desired final state. 



o  Relevance -- The management interface operates only on management- 

   relevant state. In places where the flexibility is useful rather 

   than just confusing, the management interface permits decentralized 

   management by individual users. 



o  Uniformity -- Different kinds of components should strive for 

   uniformity in their management interfaces. This allows a single 

   manager to retain intellectual control of a larger number of kinds 

   of components. 



The management interfaces and tools make it practical for one person 

to manage large domains. An interactive management tool can invoke 

the management interfaces of all components in a domain. It provides 

suitable ways to display and correlate the data, and to change the 

management-relevant state of components. Management tools are capable 

of making the same state change in a large set of similar components 

in a domain via iterative calls. To provide the flexibility to invent 

new management operations, some management tools support the construction 

of programs that call the management interfaces of domain components. 



4.5 Availability Model



To achieve high availability of a service there must be multiple 

servers for that service. If these servers are structured to fail 

independently, then any desired degree of availability can be achieved 

by adjusting the degree of replication. 



The most practical scheme for replication of the services in BOB 

is primary/backup, in which a client uses one server at a time and 

servers are arranged (as far as possible) to fail in a benign way, 

say by stopping. The alternative method, called active replication, 

has the client perform each operation at several servers. Active 

replication uses more resources than primary/backup but has no failover 

delays and can tolerate arbitrary failure behavior by servers. 



To see how primary/backup works, recall from the naming model discussion 

that the object that represents a service includes a set of servers 

and some rules for choosing one. If the chosen server (the primary) 

fails, then the client can failover to another server (the backup) 

and repeat the operation. The client assumes failure if a response 

to an operation does not occur within a timeout period. The timeout 

should be as short as possible so that the latency of an operation 

that fails  over is comparable to the usual latency. In practice, 

setting good timeouts is hard and fixed timeouts may not be adequate. 

If clients can do operations that change long-term state then the 

primary server must keep the backup servers up-to-date.



To achieve transparent failover from the point of view of client 

programs, knowledge of the multiple servers should be encapsulated 

in an agent on the client computer. In this chapter we refer to the 

agent as a clerk. The clerk can export a logically centralized, logically 

local service to the client program, even when the underlying service 

implementation is distributed, replicated, and remote. The clerk 

software can have many different structural relationships to its 

client. In simple cases it can be runtime libraries loaded into the 

client address space and be invoked with local procedure calls. Or 

it may operate in a separate address space on the same machine as 

the client and be invoked by same-machine IPC, same-machine RPC, 

or callbacks from the operating system. Or it can be in the operating 

system itself. 



The clerk interface need not be the same as the server interface. 

Indeed, the server interface usually will be significantly more complex. 

In addition to implementing server selection and failover, a clerk 

may provide caching and write behind to improve performance, and 

can implement aggregate operations that read-modify-write server 

state. As simple examples of caching, a name service clerk might 

remember the results of recently looked up names and maintain open 

connections to frequently used name servers, or a file service clerk 

might cache the results of recent file directory and file data reads. 

Write-behind allows a clerk to batch several updates as a single 

server operation which can be done asynchronously, thus reducing 

the latency of operations at the clerk interface. Implementing a 

client's read-modify-write operation might require the clerk to use 

complex retry strategies involving several server operations when 

a failover occurs at some intermediate step. 



As an example of how a clerk masks the existence of multiple servers, 

consider the actions involved in listing the contents of 

/com/dec/src/udir/bwl/Mail/inbox, a BOB file system directory. The 

client program presents the entire path name to the file service 

clerk. The clerk locates a name server that stores the root directory 

and presents the complete name. That server may store the directories 

down to, say, /com/dec. The directory entry for /com/dec/src will 

indicate another set of servers. So the first lookup operation will 

return the new server set and the remaining unresolved path name. 

The clerk will then contact a server in the new set and present the 

unresolved path src/udir/bwl/Mail/inbox. This server discovers that 

src/udir is a junction to a file system volume, so returns the junction 

information and the unresolved path name udir/bwl/Mail/inbox. Finally, 

the clerk uses the junction information to contact a file server, 

which in this example actually stores the target directory and responds 

with the directory contents. What looks like a single operation to 

the client program actually involves RPCs to three different servers 

by the clerk. 



This example of a name resolution would be completed with fewer or 

no operations at remote servers if the clerk has a cache that already 

contains the necessary information. In practice, caches are part 

of most clerk implementations and most operations are substantially 

speeded by the presence of cached data. 



Other issues arise when implementing a high-availability service 

with long term state. The servers must cooperate to maintain consistent 

state among themselves, so a backup can take over reasonably quickly. 

Problems to be solved include arranging that no writes get lost during 

failover from one server to another and that a server that has been 

down can recover the current state when it is restarted. Combining 

these requirements with caching and write-behind to obtain good performance, 

without sacrificing consistent sharing, can make implementing a highly 

available service quite challenging. 





5. Conclusion



This chapter covers the inherent strengths of centralized and networked 

computing systems. It outlines the structure and properties of BOB, 

a state-of-the-art distributed computing base for supporting general-purpose 

computing. This system combines the best features of centralized 

and networked systems with recent advances in security and availability 

to produce a powerful, cost-effective, easy-to-use computing environment. 



Getting systems like BOB into widespread use will be hard. Given 

the state-of-the-art in distributed systems technology, building 

a prototype for proof of concept is certainly feasible. But the only 

practical method for getting widespread use of systems with these 

properties is to figure out ways to approach the goal by making incremental 

improvements to existing networked systems. This requires producing 

a sequence of useful, palatable changes that lead all the way to 

the goal. 





6. Acknowledgements



The material in this chapter was jointly developed by Michael Schroeder, 

Butler Lampson, and Andrew Birrell. The ideas explored here aggregate 

many years of experience of the designers, builders, and users of 

distributed systems. Colleagues at SRC and fellow authors of this 

book have provided useful suggestions on the presentation.