Recover from Excessive Faults in Partially-Synchronous BFT SMR

On April 27, 2025June 27, 2025 By lewis-pyeLeave a comment

Byzantine fault-tolerant (BFT) state machine replication (SMR) protocols form the basis of modern blockchains as they maintain a consistent state across all blockchain nodes while tolerating a bounded number of Byzantine faults. We analyze BFT SMR in the excessive fault setting where the actual number of Byzantine faults surpasses a protocol’s tolerance. We start by devising the very first repair algorithm for linearly chained and quorum-based partially synchronous SMR to recover from faulty states caused by excessive faults. Such a procedure can be realized using any commission fault detection module — an algorithm that identifies the faulty replicas without falsely locating any correct replica. We achieve this with a slightly weaker liveness guarantee, as the original security notion is impossible to satisfy given excessive faults. We implement recoverable HotStuff in Rust. The throughput resumes to the normal level (without excessive faults) after recovery routines terminate for 7 replicas and is slightly reduced by ≤4.3% for 30 replicas. On average, it increases the latency by 12.87% for 7 replicas and 8.85% for 30 replicas. Aside from adopting existing detection modules, we also establish the sufficient condition for a general BFT SMR protocol to allow for complete and sound fault detection when up to (n−2) Byzantine replicas (out of n total replicas) attack safety. We start by providing the first closed-box fault detection algorithm for any SMR protocol without any extra rounds of communication. We then describe open-box instantiations of our fault detection routines in Tendermint and Hotstuff, further reducing the overhead, both asymptotically and concretely.

Joint work with Tiantian Gong, Gustavo Franco Camilo, Kartik Nayak, and Aniket Kate.

Here is the pdf (USENIX Security 2025).

Morpheus Consensus: Excelling on trails and autobahns

On March 17, 2025October 23, 2025 By lewis-pyeLeave a comment

Recent research in consensus has often focussed on protocols for State-Machine-Replication (SMR) that can handle high throughputs. Such state-of-the-art protocols (generally DAG-based) induce undue overhead when the needed throughput is low, or else exhibit unnecessarily-poor latency and communication complexity during periods of low throughput.

Here we present Morpheus Consensus, which naturally morphs from a quiescent low-throughput leaderless blockchain protocol to a high-throughput leader-based DAG protocol and back, excelling in latency and complexity in both settings. During high-throughout, Morpheus pars with state-of-the-art DAG-based protocols, including Autobahn. During low-throughput, Morpheus exhibits competitive complexity and lower latency than standard protocols such as PBFT and Tendermint, which in turn do not perform well during high-throughput.

The key idea of Morpheus is that as long as blocks do not conflict (due to Byzantine behaviour, network delays, or high-throughput simultaneous production) it produces a forkless blockchain, promptly finalizing each block upon arrival. It assigns a leader only if one is needed to resolve conflicts, in a manner and with performance not unlike Autobahn.

Morpheus, pdf. This is joint work with Ehud Shapiro (OPODIS 25)

Snowman for partial synchrony

On February 6, 2025June 27, 2025 By lewis-pyeLeave a comment

Snowman is the consensus protocol run by blockchains on Avalanche. Recent work of ours established a rigorous proof of probabilistic consistency for Snowman in the synchronous setting, under the simplifying assumption that correct processes execute sampling rounds in `lockstep’. In this paper, we describe a modification of the protocol that ensures consistency in the partially synchronous setting, and when correct processes carry out successive sampling rounds at their own speed, with the time between sampling rounds determined by local message delays.

Joint work with Aaron Buchwald, Stephen Buttolph, and Kevin Sekniqi.

Snowman for partial synchrony, pdf.

The Economic Limits of Permissionless Consensus

On May 18, 2024March 20, 2026 By lewis-pyeLeave a comment

The purpose of a consensus protocol is to keep a distributed network of nodes “in sync,” even in the presence of an unpredictable communication network and adversarial behavior by some of the
participating nodes. In the permissionless setting relevant to modern blockchain protocols, these nodes may be operated by a large number of unknown players, with each player free to use multiple identifiers and to start or stop running the protocol at any time. Establishing that a permissionless consensus protocol is “secure” thus requires both a distributed computing argument (that the protocol guarantees consistency and liveness unless the fraction of adversarial participation is sufficiently large) and an economic argument (that carrying out an attack would be prohibitively expensive for a potential attacker). There is a mature toolbox for assembling arguments of the former type; the goal of this
paper is to lay the foundations for arguments of the latter type. For example, the Ethereum protocol is oft-claimed to be “more economically secure” after “the merge,” meaning in its current proof-of-stake
incarnation relative to the (proof-of-work) original. What, formally, does this assertion mean? Is it true? Could there be alternative protocols that are “still more economically secure” than Ethereum? How do the answers depend on the assumptions imposed on, for example, the reliability of message delivery or the active participation of non-malicious players?

An ideal permissionless consensus protocol would, in addition to satisfying standard consistency and liveness guarantees, render consistency violations prohibitively expensive for the attacker without collateral damage to honest participants—for example, by programatically confiscating an attacker’s resources without reducing the value of honest participants’ resources, as is the intention for slashing in a proof-of-stake protocol. We make this idea precise with our notion of the EAAC (expensive to attack in the absence of collapse) property, and prove the following results:

(1) In the synchronous and dynamically available setting (in which the communication network is reliable but non-malicious players may be periodically inactive), with an adversary that controls at least
one-half of the overall resources, no protocol can be EAAC. In particular, this result rules out EAAC for all typical longest-chain protocols (be they proof-of-work or proof-of-stake).

(2) In the partially synchronous and quasi-permissionless setting (in which resource-controlling non-malicious players are always active but the communication network may suffer periods of unreliability),
with an adversary that controls at least one-third of the overall resources, no protocol can be EAAC. In particular, slashing in a proof-of-stake protocol cannot achieve its intended purpose if message delays cannot be bounded a priori.

(3) In the synchronous and quasi-permissionless setting, there is a proof-of-stake protocol with slashing that, provided the adversary controls less than two-thirds of the overall stake, satisfies the EAAC property.

All three results are optimal with respect to the size of the adversary. With respect to Ethereum, our work formalizes the potential security benefits of proof-of-stake sybil-resistance coupled with slashing and
the common belief that the merge has increased Ethereum’s economic security. Our work also provides mathematical justifications for several key design decisions behind the post-merge Ethereum protocol,
ranging from long cooldown periods for unstaking to economic penalties for inactivity.

The Economic Limits of Permissionless Consensus: pdf. (Economics and Computation 2024.) Also presented at SBC 2024.

Winner of “Best Theoretical Research Paper” in the Best DeFi Papers Awards, presented at DeFi’25.

This is joint work with Eric Budish and Tim Roughgarden.

Frosty: Bringing strong liveness guarantees to the Snow family of consensus protocols.

On April 25, 2024January 18, 2025 By lewis-pyeLeave a comment

Snowman is the consensus protocol implemented by the Avalanche blockchain and is part of the Snow family of protocols, first introduced through the original Avalanche leaderless consensus protocol. A major advantage of Snowman is that each consensus decision only requires an expected constant communication overhead per processor in the “common” case that the protocol is not under substantial Byzantine attack, i.e. it provides a solution to the scalability problem which ensures that the expected communication overhead per processor is independent of the total number of processors $n$ during normal operation. This is the key property that would enable a consensus protocol to scale to 10,000 or more independent validators (i.e. processors). On the other hand, the two following concerns have remained:

(1) Providing formal proofs of consistency for Snowman has presented a formidable challenge.

(2) Liveness attacks exist in the case that a Byzantine adversary controls more than $O(\sqrt{n})$ processors, slowing termination to more than a logarithmic number of steps.

In this paper, we address the two issues above. We consider a Byzantine adversary that controls at most $f<n/5$ processors. First, we provide a simple proof of consistency for Snowman. Then we supplement Snowman with a `liveness module’ that can be triggered in the case that a substantial adversary launches a liveness attack, and which guarantees liveness in this event by temporarily forgoing the communication complexity advantages of Snowman, but without sacrificing these low communication complexity advantages during normal operation.

Frosty, pdf, FC 2025. This is joint work with Aaron Buchwald, Stephen Buttolph, Patrick O’Grady and Kevin Sekniqi of Ava Labs.

Lumiere: Making Optimal BFT for Partial Synchrony Practical

On January 18, 2024April 22, 2024 By lewis-pyeLeave a comment

The view synchronization problem lies at the heart of many Byzantine Fault Tolerant (BFT) State Machine Replication (SMR) protocols in the partial synchrony model, since these protocols are usually based on views. Liveness is guaranteed if honest processors spend a sufficiently long time in the same view during periods of synchrony, and if the leader of the view is honest.
Ensuring that these conditions occur, known as Byzantine View Synchronization (BVS), has turned out to be the performance bottleneck of many BFT SMR protocols.

A recent line of work has shown that, by using an appropriate view synchronization protocol, BFT SMR protocols can achieve $O(n^2)$ communication complexity in the worst case after GST, thereby finally matching the lower bound established by Dolev and Reischuk in 1985. However, these protocols suffer from two major issues:
(a) When implemented so as to be optimistically responsive, even a single Byzantine processor may infinitely often cause $\Omega(n\Delta)$ latency between consecutive consensus decisions.
(b) Even in the absence of Byzantine action, infinitely many views require honest processors to send $\Omega(n^2)$ messages.

Here, we present Lumiere, an optimistically responsive BVS protocol which maintains optimal worst-case communication complexity while simultaneously addressing the two issues above: for the first time, Lumiere enables BFT consensus solutions in the partial synchrony setting that have $O(n^2)$ worst-case communication complexity, and that eventually always (i.e., except for a small constant number of “warmup” decisions) have communication complexity and latency which is linear in the number of actual faults in the execution.

Lumiere pdf, PODC 2024.

This is joint work with Dahlia Malkhi, Oded Naor, and Kartik Nayak.

Permissionless Consensus

On April 28, 2023May 20, 2024 By lewis-pyeLeave a comment

Blockchain protocols typically aspire to run in the permissionless setting, in which nodes are owned and operated by a large number of diverse and unknown entities, with each node free to start or stop running the protocol at any time. This setting is more challenging than the traditional permissioned setting, in which the set of nodes that will be running the protocol is fixed and known at the time of protocol deployment. The goal of this paper is to provide a framework for reasoning about the rich design space of blockchain protocols and their capabilities and limitations in the permissionless setting.
This paper offers a hierarchy of settings with different “degrees of permissionlessness”, specified by the amount of knowledge that a protocol has about the current participants: These are the fully permissionless, dynamically available and quasi-permissionless settings.
The paper also proves several results illustrating the utility of our analysis framework for reasoning about blockchain protocols in these settings. For example:
(1) In the fully permissionless setting, even with synchronous communication and with severe restrictions on the total size of the Byzantine players, every deterministic protocol for Byzantine agreement has an infinite execution in which honest players never terminate.
(2) In the dynamically available and partially synchronous setting, no protocol can solve the Byzantine agreement problem with high probability, even if there are no Byzantine players at all.
(3) In the quasi-permissionless and partially synchronous setting, by contrast, assuming a bound on the total size of the Byzantine players, there is a deterministic protocol proof-of-stake protocol for state machine replication.

(4) In the dynamically available, authenticated, and synchronous setting, no optimistically responsive
state machine replication protocol guarantees consistency and liveness, even when there are no Byzantine players at all.
(5) In the quasi-permissionless and synchronous setting, every proof-of-stake protocol that uses only time-malleable oracles is vulnerable to long-range attacks.

Permissionless Consensus: pdf (the conference version appeared in FC’23, but this is a significant rewrite).

This is joint work with Tim Roughgarden.

This journal version of the paper subsumes the earlier conference versions “Byzantine Generals in the Permissionless Setting” and “Resource Pools and the CAP Theorem”, substantially revises the frameworks presented in those papers, and presents a number of new results.

The Consensus Canon

On April 2, 2023May 12, 2023 By lewis-pyeLeave a comment

a16z asked me to produce a list of resources for those who want to get quickly up-to-date with the consensus literature. You can find it here.

Fever

On February 10, 2023January 24, 2024 By lewis-pyeLeave a comment

View synchronisation is an important component of many modern Byzantine Fault Tolerant State Machine Replication (SMR) systems in the partial synchrony model. Roughly, the efficiency of view synchronisation is measured as the word complexity and latency required for moving from being synchronised in a view of one correct leader to being synchronised in the view of the next correct leader.

The efficiency of view synchronisation has emerged as a major bottleneck in the efficiency of SMR systems as a whole. A key question remained open: Do there exist view synchronisation protocols with asymptotically optimal quadratic worst-case word complexity that also obtain linear message complexity and responsiveness when moving between consecutive correct leaders?

We answer this question affirmatively with a new view synchronisation protocol for partial synchrony assuming minimal clock synchronisation, called \emph{Fever}. If $n$ is the number of processors and $t$ is the largest integer $<n/3$, then Fever has resilience $t$, and in all executions with at most $0\leq f\leq t$ Byzantine parties and network delays of at most $\delta \leq \Delta$ after $GST$ (where $f$ and $\delta$ are unknown), Fever has worst-case word complexity $O(fn+n)$ and worst-case latency $O(\Delta f + \delta)$.

Fever: pdf (OPODIS 2023)

This is joint work with Ittai Abraham.

Flash: An Asynchronous Payment System with Good-Case Linear Communication Complexity

On January 5, 2023January 24, 2024 By lewis-pyeLeave a comment

While the original purpose of blockchains was to realize a payment system, it has been shown that, in fact, such systems do not require consensus and can be implemented deterministically in asynchronous networks. State-of-the-art payment systems employ Reliable Broadcast to disseminate payments and prevent double spending, which entails O(n^2) communication complexity per payment even if Byzantine behavior is scarce or non-existent.
Here we present Flash, the first payment system to achieve O(n) communication complexity per payment in the good case and O(n2) complexity in the worst-case, matching the lower bound. This is made possible by sidestepping Reliable Broadcast and instead using the blocklace — a DAG-like partially-ordered generalization of the blockchain — for the tasks of recording transaction dependencies, block dissemination, and equivocation exclusion, which in turn prevents doublespending.
Flash has two variants: for high congestion when multiple blocks that contain multiple payments are issued concurrently; and for low congestion when payments are infrequent.

Flash: An Asynchronous Payment System with Good-Case Linear Communication Complexity, pdf.

This is joint work with Ehud Shapiro and Oded Naor.

Andrew Lewis-Pye

Author: lewis-pye