Ongaro’s thesis describes
PreVote phase in Raft that is designed to prevent a partitioned
server from disrupting the cluster by forcing a re-election when it rejoins the
cluster. This article is about the benefits of using a similar mechanism in a
The trick with showing the liveness of a Paxos cluster boils down to making
sure that nodes send the right quantitiy of
prepare (a.k.a. phase
messages: too few can lead to deadlock, whereas too many can lead to livelock.
The other messages (
accepted) affect the safety of
the system, but
prepare messages have no bearing on safety. In contrast, as
long as they are sent in a timely fashion the other messages have no real
bearing on liveness.
Presentations of Paxos often show its liveness property by assuming the existence of a distinguished leader which is elected by some unspecified external process. Raft has no need for a separate leader-election process as it has a built-in system of timeouts instead, which is much cuter. Ongaro’s thesis does not show that Raft satisfies any particular liveness properties, but it certainly seems plausible that it works. [I am unaware of further work on Raft’s liveness properties but please get in touch if you know of some.]
It’s possible to run Paxos without an external leader-election process in much
the same way that Raft works, and I found that a similar
PreVote phase was
very helpful in ensuring its liveness. The mechanism is as follows. Each term
number has a well-defined owner, and nodes consider the owner of the term of
the most-recently-chosen value to be the leader, as long as this value was
chosen sufficiently recently. In more detail, each node is in one of four
Leader - this node owns the term of the most-recently-chosen value that it knows about, and this value was chosen very recently.
Incumbent - this node owns the term of the most-recently-chosen value that it knows about, and this value was chosen recently, and not very recently.
Follower - some other node owns the term of the most-recently-chosen value that it knows about, and this value was chosen recently or very recently.
Candidate - no value is known to have been chosen recently.
The definitions of recently and very recently are in terms of timeouts set by the system’s operator. More formally, nodes implement the following state machine.
State transitions due to timeouts are shown in red, and those due to values being chosen are shown in green (if in a term owned by the current node) or blue (if in a term owned by another node). If a leader times out and becomes an incumbent it also attempts to propose a new no-op value in a term that it owns which if successfully chosen will reinstate it as a leader again. The timeouts should be set such that leaders become incumbents in good time to become leaders again before any followers become candidates. Note that each timeout is implemented based on the node’s local clock and there is no need to share time values between nodes, so there is no need for any kind of clock synchronisation in this setup.
It is hopefully reasonably obvious from this that eventually either there exists a node that has learned the next value or else all nodes have become candidates.
When a candidate times out it remains a candidate but also attempts to make
progress (i.e. to learn that a later value is chosen) as follows. Firstly, it
seek-votes message indicating the latest value it learned. The
recipients of this message may do one of three things:
- If the recipient knows that a later value has been chosen then it responds
offer-catch-upmessage indicating as such.
- If not, but the recipient believes there is an active leader (i.e. they are not themselves a candidate) then the message is ignored.
- If neither of the previous cases apply then the recipient responds with an
offer-votemessage indicating the greatest term number for which it has previously sent a
If a candidate receives an
offer-catch-up message it communicates directly
with the sender in order to learn the missing values. If instead it receives
offer-vote messages from a quorum of its peers then it sends out a
message using a term that is no less than any of the terms received in the
offer-vote messages, and runs the usual protocol from there on.
The protocol as described is eventually quiescent: absent any external stimulation (timeouts or client activity) eventually the system reaches a state where there are no messages either in flight or being processed. This is, approximately, because there are no loops in the graph of the messages that can be transmitted on receipt of each type of message:
This helps to completely rule out certain kinds of livelock. The situation is a
little more complicated than this because a cluster reconfiguration requires a
new phase 1 to take place, so an
accepted message can result in a
message too, but this doesn’t result in livelock as there can only be finitely
many pending reconfigurations.
If the round-trip time between nodes is bounded then this observation puts a bound on the time it takes for the system to become quiescent after external stimulation. In particular if there are no pending proposals at any node then it takes at most three round-trips to become quiescent.
As noted above, eventually either there exists a node that has learned the next value or else all nodes have become candidates. It remains to show that if all nodes are candidates, at least one of which can communicate with a quorum of its peers, then the system still makes progress.
If candidates wake up too frequently then the system will never become quiescent, but if they wake up too infrequently then the system will take a long time to recover from an unexpected failure of the leader. Absent a tight bound on the network’s round-trip time, it is preferable that candidates wake up with random and growing timeouts because as long as the timeouts grow to be sufficiently long compared with the communication round-trip-time between nodes, eventually-almost-certainly one candidate will time out at a point in time when the system is quiescent and no other node will time out for at least three round-trips.
Sketch Theorem Such a candidate, if connected to a quorum of peers, makes progress.
Sketch Proof. Let n such a candidate. As noted above, its peers are
eventually either all candidates or else contain a node which has learned some
later values. Also, eventually-almost-certainly, it is the only active node. It
seek-votes message and receives responses from its peers. It
is connected to a quorum of peers, so it either receives an
message or a quorum of
In the former case it learns some later chosen values from the sender. In the
latter case it sends out a
prepare message to start phase 1 of the usual
Paxos algorithm. Since n is the only active node, no other
has been sent since any of its peers sent out their
which means that they will all respond with
On receipt of this quorum of
promised messages, n may propose a value. If
none of its peers have previously accepted a value in an earlier term then it
may freely choose to propose any value. Typically it would choose a
action in this situation.
On the other hand if some of its peers have previously accepted a value then it
must propose one of their values according to the rules of the Paxos algorithm.
Again since n is the only active node then no other
promised messages will
have been sent before its peers receive the
proposed message, so the proposal
will result in a quorum of
accepted messages which results in the proposed
value being chosen and learned as required. QED.
Unboundedly large messages
The time it takes to send and receive a message is a function of its size, so
it is only safe to assume that the round-trip time is bounded if messages have
bounded size. In practice if a message is too large then nodes time-out before
it is fully delivered, which can lead to a leadership election, and also breaks
the liveness proof above: if
promised messages may take arbitrarily long to
deliver then there may never be a time when there is a unique active node and
therefore no election may ever succeed. It’s a good idea to limit the size of
values and to set any timeout limits accordingly.
Alternatively, one could split up
proposed messages into
chunks of bounded size and reset the timeout timer each time a chunk is
There is no need for nodes to send out
prepare messages for terms that they
own, and this can be used to implement an abdication mechanism allowing the
current leader to step down having determined its successor. The alternative to
abdication is for the leader to simply stop sending out any messages and to
wait for the other nodes to time out. The disadvantages of this include
the system is effectively unavailable until the timeouts expire, whereas with abdication the new leader normally takes over within a few round-trips without any risk of contention
the new leader is chosen by a nondeterministic race, whereas with abdication the new leader can be selected explicitly. For example, if there are a number of clusters running in a shared environment then it may make sense to try and distribute the leaders as evenly as possible across the environment.
A leader abdicates simply by sending a
prepare message for a term owned by
the new leader that is larger than the current term. Nodes should respond with
promised message sent to the owner of the term and not to the original
sender, and from there the usual Paxos message flow leads to the new leader
owning the term of the next value to be chosen and therefore being recognised
as the cluster leader.
As mentioned above, there could be a completely separate leader-election
process which determines which node is permitted to send
This seems inelegant compared with the idea of building leader-election into
the consensus system, but could be appropriate if there are particular
requirements on how nodes may become leader that are not well-served by the
Another strategy that implementations sometimes use is to consider the leader
to have failed if a value has not been chosen for a period, and to immediately
send out a
prepare message after this timeout expires. Raft (without its
PreVote phase) uses this kind of strategy. The main challenge with this
approach seems to be the selection of an appropriate term to put in the
prepare message: it’s common to choose a term that is slightly higher than
the greatest term seen so far in any known message. In practice this seems to
work ok although it does seem to suffer from (temporary) livelocks more. In
theory it’s a bit problematic: my attempted proofs that this works always
seemed to get stuck on the fact that there’s no guarantee that the selected
term is large enough, so any particular node may fail to be elected when it
times out. It’s not at all obvious that always there is eventually a node that
selects a high-enough term.
It was these failed proofs that led me to the idea of the
offer-vote responses to determine an appropriate term for the subsequent
prepare message. With that idea, the proof fell into place.
Disruption on reconnection
As in Raft, this extra phase prevents nodes from triggering a new leader
election when they reconnect: out-of-date nodes receive a
message before they have received a quorum of
offer-vote messages. This means
that the leader tends to be stable for as long as it remains connected to a
quorum of its peers, and other nodes may only send a
prepare message once
there is a quorum of nodes that believe the leader to have failed.
After a reconnection the system can get itself into a subtle and rare state in
which it’s still making progress, but some minority of nodes have entered a
later term than the one in which the majority is working. This means that this
minority can’t accept any proposals from the leader, but nor can they gather
offer-vote messages to trigger a leader election in a later term.
Instead, they continually receive
offer-catch-up messages and learn new
values through the catch-up process. This may not be too much of a problem, and
the situation will resolve itself at the next election, but elections should be
relatively uncommon so the situation could last for many days. It’s also
operationally irritating: if it weren’t for this state then you could consider
candidate nodes to be indicative of problems in the cluster for monitoring
The fix is to include the candidate’s current term in its
and if a leader receives a
seek-votes message containing a later term then it
performs another phase 1 at an even later term. This has the effect of keeping
a stable leader while allowing the other nodes to start accepting its proposals
again, which avoids upsetting the monitoring system.
There’s an enthusiastic review of this article over on simbo1905’s blog which gives some more intuitive coverage of the ideas in this article as well as some thoughts on how this mechanism relates to the one in TRex.