Whitepaper

Abstract

Snapchain is a blockchain-like network for storing and syncing Farcaster's social data. It has stronger and faster consistency guarantees than the current deltagraph system which is finding it hard to keep all the nodes in sync in near real-time. The tradeoff we make for the consistency improvements is a new consensus step that introduces more complexity and failure modes which must be addressed.

Problem

A decentralized social network is one where two users can find each other and communicate, even under adverse conditions. Users must be able to run a node and use it to communicate with each other. Each node must reach consensus about a user’s state and stay in sync with other nodes. If Alice follows Bob at one node, it must make sure that she wasn’t already following Bob and then update this relationship on every other node.

Users generate a lot of transactions and expect real-time delivery. Twitter, for example, has 200M daily users and sees 10k TPS and is likely to see 1TB - 10TB/day in state growth. Existing decentralized networks can’t handle this kind of load with real-time delivery. It’s not because it’s impossible, but because they make tradeoffs to solve different user problems. Blockchains move money and are designed to prevent double spends, which makes sharding and pruning data difficult. Federated systems like email are shard-able but have weak decentralization and consistency, which makes apps harder to build. See Appendix D for more details.

Farcaster has used a CRDT-based system called a deltagraph to decentralize social data. By defining every transaction as a CRDT operation, consensus is reached immediately without coordination at the local node. The changes are then gossiped out to peers which can lazily update their own state. The network served 100k users doing 500 TPS with 2GB/day state growth in early 2024.

As the network grew to thousands of nodes, some of them get out of sync due to gossip failures. Since CRDTs are unordered, a node could only detect gossip failures by syncing manually with every other node and comparing all transactions. This becomes slow and eventually infeasible as the number of nodes and valid messages cross some threshold (see Appendix C). The lack of ordering also meant that the network cannot enforce global rate limits, and they must be localized to each node. The side effect of this is that a transaction that passes the limits on one node might be rejected by another. Without strict ordering, it's hard to guarantee both real-time delivery and strong consistency.

Specification

Snapchain introduces transaction ordering and blockchain-like semantics to Farcaster. A block production step is added which groups and orders user transactions. Syncing is much simpler since a node only needs to find and download missing blocks. Snapchain, like the deltagraph before it, relies on an external blockchain to handle account creation and fee collection.

Snapchain is different from most blockchains because its transactions are not turing complete, are account independent and pruned often. A transaction is a "post", "like" or other social operation which only affects a single account. This is important for scaling since it prevents the network from being used for non-social purposes and makes sharding by account easy. Older transactions are pruned to clear data from inactive users or negating transactions, such as when a user likes and unlikes the something.

The initial release of snapchain should support a TPS of > 9000 which would support 2 million daily users.

1. Accounts

Users create and manage accounts using an external blockchain. This incurs some fees during setup but is necessary for the strong security and consistency guarantees. Calling the registry contract onchain issues a unique account number or farcaster id to the wallet. Signed messages from this wallet are treated by Snapchain as authorized actions from the account. Accounts can be transferred between wallets at any time, though an address may only own one account at a time.

Accounts can acquire human-friendly ENS usernames by proving ownership with an onchain or offchain proof. All references to the account are made to the farcaster id, which in turn is mapped to the verified ENS username by clients. This lets users change their username without having to resign all data on the network. This system can also be extended to non-ENS name systems if desired.

Accounts can issue "app keys" onchain which are keys with a narrower set of permissions. They can post messages on behalf of the account but cannot change ownership of the account or modify other app keys. They are used like auth tokens to delegate permissions to clients safely. It may be possible to implement app keys on Snapchain in future, avoiding onchain fees for modifying them.

Account recovery is built into registry contract which lets the wallet nominate another address which also controls the farcaster id. This could be set to the user's primary wallet, an m-of-n social recovery multisig or institutional recovery wallet. User may also compose their own recovery systems by converting the wallet into a smart wallet which can implement custom recovery logic.

2. Transactions

A blockchain transaction is a Farcaster specific transactions that happens on an external blockchain. An example is when Alice makes a transaction to the registry contract to get her farcaster id and set up her app keys. Snapchain nodes listen to and store blockchain transactions in their history.

A snapchain transaction is a social action like making a new post. Alice says “Hello World” by making an add-post transaction, signing it with her app key and broadcasting it. Nodes verify that every transaction is correctly signed according to the specification. Common actions like deleting posts or following other users have their own transaction types. Snapchain transactions are self-authenticating and anyone trace the authenticity from the message to the app key to the wallet to the farcaster id.

3. Account State

An account comes into existence when a blockchain transaction is made to create a new account in the registry. It's state is simply the set of blockchain and snapchain transactions that it generates. A deterministic state root can be computed by putting all the transaction ids into a merkle trie. Transactions made by one account cannot affect the state of another account. Enforcing this restriction makes Snapchain more scalable since account-level sharding becomes trivial to implement.

When a new transaction is accepted, it may be added to the state or it may replace a previous transaction in the state or it may delete a previous transaction entirely.In the example below, we see Alice’s account state changing as she creates an account, adds a post and then deletes it.

Formal definition: There exists a state (S) for an account (A) made up of transactions. S is a subset of all transactions made by a user (S ⊆ Ta). A merge function M accepts an S and t and returns a new state S’ ( M

×t→S′). Each T is idempotent but not associative or commutative.

4. Blocks

Snapchain and blockchain transactions are sequenced into blocks. A block must have a signature from the block producer, a link to the previous block and a global state root. The global state root is the root of the global state trie, whose leaves are the roots of each account state trie. If the state of any account changes, the global state root also changes.

Blocks are produced by a committee of block validators and tendermint is used to reach consensus. A leader is chosen to produce the block and at least two-thirds of other validators must sign off. Snapchain is byzantine tolerant and up to one-third of the network can be malicious without affecting block production. Validators are selected through a voting committee which is described in Appendix A.

Blocks are grouped into epochs that are K blocks in length. A special epoch block is published at the beginning which contains additional metadata used to re-configure chain parameters. These blocks must be preserved forever and cannot be pruned. One example of epoch metadata is the leader rotation schedule. Leaders must be rotated periodically or if they fail to produce a block. The schedule for the next K blocks is determined using a deterministic function and included in every epoch block.

Nodes get new blocks from their peers and update account states. After a week, non-epoch blocks can be pruned by nodes to free up disk space. Pruning permanently removes deleted posts and likes which is desirable feature for users. The week’s delay ensures that nodes that go offline even for a few days can catch up by streaming blocks from their peers.

Nodes that go offline for long periods (or that start from scratch) must use snapshot sync instead. The protocol will publish daily snapshots of the global state to a file server as a public good. The snapshot is tamper-proof since modifying transactions will invalidate block signatures and omitting transactions will invalidate the global root. Nodes can download the state snapshot and then stream blocks from their peers to catch up.

5. State Rent

Decentralized networks can be flooded with transactions which consume disk space, bandwidth and compute. Blockchains control this by imposing a per-transaction fee, but this isn’t great for a social network. If users have to worry about fees for each post, they will post less frequently which is bad for the network.

Snapchain gives users practically unlimited transactions if they pay a yearly fee. Users must rent a storage unit on the external blockchain after creating an account. Each unit gives them a rate limit (500 tx/hour) and a storage limit (10,000 txns) for their account state. Users can buy multiple units to increase these limits but in practice 99% of users rarely need more than one.

Usage feels “unlimited” because when storage limits are exceeded a user’s oldest transaction is discarded instead of preventing the newer transaction from confirming. Each transaction type (post, like, follow) has its own set of limits and a newer post will push out the oldest post. This generally works well in a timeline based social network because older posts are rarely revisited and most users are comfortable with the ephemeral behavior. Those who want more permanence can pay for additional storage units or archive data elsewhere.

The benefits of this system are that users don’t really have to think about storage and can just keep using the network. One downside is that a single storage unit must have separate, fixed limits for each types and users with different usage patterns may feel that they are wasting storage. Another downside is that while expiring the oldest message is a reasonable decision for posts, it may not be the right tradeoff for something like a follow. Apps may need to implement safeguards to protect users from blowing away certain historical data when limits are exceeded.

6. Sharding

Snapchain can be sharded into N segments using N+1 tendermint chains to improve scalability. Accounts are assigned to a chain using a deterministic function. In the example below, odd numbered accounts are assigned to one shard and even ones to the other. The 0th chain is used to unite all the shards so that they appear as a single chain. Our approach to sharding is inspired by NEAR’s Nightshade.

A shard chain must have at least three validators and store all relevant account state. Validators may be automatically or manually rotated between shards through a validator schedule in the epoch block. Erasure coding is used to distribute account state from one shard across validators in other shards so that the data is still available even if all validators within a shard fail.

Block production is triggered when the previous block is finalized. Each shard chain bundles transactions into a block and computes a shard root, which is like the global root but limited to accounts in a shard. The 0th shard chain waits for the N shards to be produced and then performs another tendermint step bundling them into a single block and computes a global state root across the shard roots.

7. Sync

Nodes rely on gossip as the primary mechanism for p2p communication. When a block is produced, the block is gossiped out separately from the shards that compose it. Gossip failures are reasonably easy to recover from due to ordering. If a block is skipped, a sequence jump will be detected and the node is aware that they missed a block. All nodes will expose rpc endpoints which can be used to fetch older blocks.

Validators also rely on gossip to manage the mempool and for inter-validator communication when consensus is being reached on the state of a block. All the tendermint consensus steps happen via gossip messages.

8. Handling Failures

Validators can fail in a variety of ways and we must define how the network behaves in each scenario. Let’s start with the honest malfunctions:

Shard leader fails to produce a shard — after 1 second, consensus changes leadership according to the rotation. We can tolerate the failure of up to 1/3 of the validators.
A shard is not produced in time for the block — block production continues. If they fail to produce a shard for an entire epoch, the chain is halted.
A block is not produced — after 1 seconds, consensus changes leadership according to the rotation. We can tolerate the failure of up to 1/3 of the block leaders.

If nodes are behaving maliciously, there are more attack scenarios that are possible:

Block leader excludes shards or halts production — mitigated by rotating leaders, but governance action is needed to evict them permanently and solve the issue.
Shard leader excludes a user’s transactions — mitigated by rotating shard leaders, but governance action is needed to evict them * permanently and solve the issue.
Shard validator majority excludes a user’s transactions — if more than 2/3rd of a validators shards collude they can censor a user, and governance action is needed to resolve.
Block validator majority excludes a shard — if more than 2/3rd of block validators collude they can censor a shard, and governance action is needed to resolve.
Shard validator majority can reissue a shard before block finality — TBD are malicious, they can reissue a shard for a block before it gets finalized.
If > 2/3 majority of block validators and > 2/3 majority of one shard validators collude, they can reissue a block which would cause a network fork. Requires a refork and restart of the network,

9. Upgrade process

New node versions are released frequently, and nodes are expected to be kept upto date. There are two kinds of version upgrades:

Non-consensus breaking upgrades

These are the most common kind, usually containing bug fixes or performance improvements. They are backwards compatible and there is no need to coordinate with other nodes. Nodes can be upgraded at any time and will continue to work with older nodes. These changes are denoted with a patch version bump (e.g. 0.1.0 -> 0.1.1).

Consensus breaking upgrades

These are less common and usually contain breaking changes to the protocol. The changes are not backwards compatible, and a node may halt if it's not upgraded. These changes are denoted with a minor version bump (e.g. 0.1.0 -> 0.2.0).

Each block contains a version number for the protocol in its header. When a consensus breaking change is required, a new minor version is released with a PROTOCOL_VERSION bump and a timestamp after which it will take effect. All nodes must upgrade to the latest minor version before this time. All blocks produced on or after this timestamp will have this newer version. Nodes will not accept blocks with an unexpected version number and old nodes will detect they are out of date and self terminate.

Accidental breaking changes

It's also possible a bug or non-deterministic behaviour causes a breaking changes. E.g. a consensus breaking change is made without a corresponding protocol version bump. In this case, the nodes will proceed as normal until they encounter a blocks with the breaking change. At which point, the merkle roots will not match and the nodes that are not upgraded will halt. If this happens to validators, then block production will halt until a new version is released with a fix.

Frequently Asked Questions

What exactly is hard about sync today?

A question that’s come up a few times about Snapchain is some variant of “why is syncing hard today?”

There is no source of truth to sync from - Messages can be added or removed from any node at any point in history due to the eventually consistent nature of CRDTs. Changes are gossiped out when they happen, but this could fail for a variety of reasons. The only way for a node to catch up 100% is to 1) sync with every other node and compare every message and 2) prevent messages from entering the network until this is completed. There are 4000 nodes x 150 million messages today with 100s of messages changing every second making this impossible.
Rate limits cause nodes to diverge — rate limits are important to protect the network since we do not charge transaction fees. global rate limiting is impossible with crdts, so they are implemented per node. It is possible for a message to be temporarily rejected from a hub due to rate limits, but accepted by others.
Pruning complicates things — pruning means that when one message is received another, older message might be removed. this means that older state is constantly being modified by newer messages so its hard to be efficient about comparing message ids and hard to reason about why two nodes diverge.
Unidirectional sync is slow. A node can be “ahead” of another node for some accounts state and “behind” it for another account. In order for these nodes to get into sync, both of them must pull data from the other node (bidirectional sync) before any state change happens. In practice, this is challenging to implementing and we rely on unidirectional sync which means that only some state converges.

One class of solutions was “partial ordering” — the basic idea was that we would chain messages by having each message reference the previous one. The chains would either be per user or per app, instead of the total ordering that Snapchain proposes. The benefit of this approach is that we do not need a heavyweight consensus model since in the happy path each chain is typically only edited by one node at a time.

One way to think about this is that it reduces the sync space. Our nodes today must compare the total set of messages which is 150M items. If you can have a chain per user, that’s down to 1M items. If you have a chain per app that’s probably closer to just ~1000 unique items to compare per sync.

But there are still some unsolved problems:

Pruning is not possible — because there is a chain, we cannot easily prune older state because the tombstones are necessary for sync to function.
Rate limits are still hard — there’s no way to reach consensus across users or apps, so the limits would still be local and diverge.
Forking causes a lot of thrash — a user or app can “fork” their chain by introducing a conflicting message at some point in history. This would invalidate all future messages, which causes a lot of sync thrash and is an easy way to DDOS the network.
There is still no source of truth — a node still has to sync with every other node to converge because we are using CRDTs. We have reduced the search space from 4000 nodes * 150M items to 4000 nodes * 1000 app chains. But nodes will still be slightly out of sync with each other, and the problem will return as we add more nodes or items.
The migration path is messy — since messages need to be chained to other messages, we have to update older messages to this new format. but the problem is that messages are signed, and unless the user comes online with their key the message cannot be upgraded. we cannot ensure that users return, so we must either deprecate older data after some cutoff or keep both sync models built into hubs for a really long time.

Why not fork a blockchain instead of designing a new one?

An alternative to building snapchain would be to fork an existing blockchain to have similar properties. We would modify the VM so that the set of opcodes is limited to social actions and modify the transaction model to mirror snapchains rate-limit + pruning approach to metering usage. There are two challenges with this approach:

Sharding - given our tx volume and data size, we're going to need sharding soon. snapchains can be sharded by account easily because transactions are independent across accounts. blockchains have much more complicated sharding systems and we haven't found any that work in production yet. so there's a lot of implementation risk and unnecessary complexity.
Pruning - most chains we've looked at don't really have an easy way to bolt on pruning, or the ability to arbitrarily discard data from points in time cleanly. we would have to do a large refactor that touches most abstractions in the system.

Blockchains are doing a lot of work in both these areas and it is quite possible that in 2-3 years our POV on this has changed. But if we are making a decision today about the best solution for a 5 year time horizon, Snapchain seems like a better bet.

Why was tendermint chosen as the consensus algorithm?

It has been used in production systems for many years, has fast finality and good liveness guarantees. There are also well written implementations in Go and even one in Rust.

Will validators be able to censor users?

Censorship will be challenging with as few as ten globally distributed validators. There is no direct economic gain or loss caused by censorship. Users being censored can amplify their message via others and censorship is provable by observing transactions in the mempool. If all validators do collude, the voting committee described in Section A acts as a check and balance to change the validators set. If all the validators and voters collude, it may be possible to censor.

Should we take a different approach that makes censorship even harder?

It is possible to design even more decentralized forms of governance and block production to make censorship less practical. The argument against this is that censorship is already reasonably impractical and most of these designs come with great cost to system complexity or user experience which makes the network less likely to be useful. It is also important to remember that Snapchain has been upgraded in the past as requirements have changed, and can be upgraded again in the future if necesary.