ZooKeeper is a service for coordinating processes of distributed applications, which provides wait-free coordination for Internet-scale systems.
Design Decisions
Moved away from implementing specific primitives on the server side, opted for exposing an API that enables application developers to implement their own primitives, which enables new primitives without requiring changes to the service core.
Moved away from blocking primitives, such as locks (e.g. Chubby). Since blocking primitives can cause, among other problems, slow or faulty clients to impact negatively the performance of faster clients. Instead, ZooKeeper implements an API that manipulates simple wait-free data objects (znodes) organized hierarchically as in file systems.
ZooKeepers guarantees two following properties for client operations:
- FIFO client order: all requests from a given client are executed in the order that they were sent by the client. Note this order guarantee is per client level, for example, read operation order is not guaranteed among clients, since read operations are handled locally by the connected server.
- Linearizable writes: all requests that update the state of ZooKeeper are serializable and respect precendence. This is guaranteed by implementing a leader-based atomic broadcast protocol, called Zab.
Improve read performace by caching data on the client side, for example, the process can cache the identifier fo the current leader instead of probing ZooKeeper every time in needs to know the leader. ZooKeepers uses a watch mechanism, with this mechanism, a client can watch for an update to a given data object and receive a notification upon an update.
Service Overview
The ZooKeeper provides to its clients the abstraction of a set of data nodes (znodes), organized according to a hierarchical name space, referred by the standard UNIX notation for file system paths. All znodes store data, and all znodes, except for ephemeral znodes, can have children. There are two types of znodes that a client can create:
- Regular: Clients manipulate regular znodes by creating and deleting them explicitly.
- Ephemeral: Clients create such znodes, and they either delete them explicitly, or let the system remove them automatically when the session that creates them terminates.
ZooKeepers implements watches to allow clients to receive timely notification of chanes wihout requiring polling. Watches are one-time triggers associated with a session; they are unregistered once triggered or the session closes. Watches indicate that a change has happened, but do not provide the change. Session events, such as connection loss events, are also sent to watch callbacks so that clients knows that clients know that watch events may be delayed.
Znodes map to abstractions of the client application, typically corresponding to meta-data used for coordination purposes. Although znodes have not been designed for general data storage, ZooKeepers does allow clients to store some information that can be used for meta-data or configuration in a distributed computation. It also has associated meta-data with time stamps and version numbers which allow clients to keep track of its status.
A client connects to ZooKeeper and initiates a session. ZooKeeper considers a client faulty if it does not receive anything from its session for more than that timeout.
APIs
The common used APIs provided by ZooKeeper are:
1 | create(path, data, flags), where flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag, which is a monotonicaly increaseing counter appended to its name. |
All methods have both sync and async version, and ZooKeeper client guarantees that the corresponding callbacks for each operation are invoked in order.
Refer to the paper to see example primitives implemented by these APIs, including configration management, group membership, read/write locks, etc.
Implementation
ZooKeeper provides high availability by replicating the data on each server. For recoverability, it forces writes to be on the disk media before they are applied to the in-memory database. And it will keep a replay log of commited operations and generate periodic snapshots of the in-memory database.
For read request, they are serviced from the local replica, the local replica is guaranteed to be not diverged, although at any point in time some servers have applied more transactions than others. This will introduce problem that a read operation might return stale value. To gurantee a read operation retrieves the latest updated value, a client calls sync followed by a read operation.
For write request, they are forwarded to a single server, called the leader. The leader executes the request and broadcast the change to followers through Zab, an atomic broadcast protocol. Zab uses majority quorums to decide on a proposal and provides stronger order guarantees than regular atomic broadcast. Write ahead log is used to keep track of all changes to the database. Zab normally only deliver messages in order and exactly once, but may redeliver a message during recovery. Because the idempotent transaction is used, the end state is still same as long as they delivered in order.
When taking snapshot of a ZooKeeper server, the server state is not locked, instead, it depth frist scan of the tree and atomically reading each znode and write them to disk. The result might not correspond to the state of ZooKeeper at any point in time since some subsets of the state changes are delivered during the generation of the snapshot. However, since state changes are idempotent, we can apply it multiple times as long as they are in order.
Servers process writes in order and do not process other wirtes or reads concurrently. This ensure strict succession of notifications. Note that servers handle notification locally, only the server that a client is connected to tracks and triggers notification for that client.
Each read reqeust is processed and tagged with a zxid that corresponds to the last transaction seen by the server. If the client needs to connect to a new server, the new server ensures that is view of the ZooKeeper data is at least as recent as the view of the client by checking the last zxid of the client against its last zxid. And the connection won’t be establised until the server has caught up. The client is guarateed to be able to find another server that has a recent view of the system since the client only sees changes that have been replicated to a majority of the ZooKeeper servers.
ZooKeeper uses timeouts to detect client session failures. If the client cannot communicate with a server to send a request, it connects to a differernt server to re-establish the session.