# Data Availability and Sharding

How to use the slides - Full screen (new tab)

**Slides Content**

# Data Availability and Sharding

### Outline

## Data Availability Problem

How do we ensure a piece of data is retrievable without storing it on every single node forever?

Incorrectness can be proven (fraud proofs), but unavailability can't.

---v

### Data Availability Problem: Parachains

Imagine a parachain collator produces a block, but only sends it to relay chain validators to verify.

What could such a collator do?

- Prevent nodes and users from learning the parachain state
- Prevent other collators from being able to create blocks

We want other collators to be able to reconstruct the block from the relay chain.

---v

### Data Availability Problem: Relay Chain

If that block's PoV is only stored by a few validators, what if they go offline or rogue?

- Honest approval-checkers are not able to verify validity

Notes:

This is really bad. It means we could finalize an invalid parachain block.

## Problem

Notes:

I really like this comic from the paradigm article about Data Availability Sampling. But it works for our case as well with data sharding.

## Erasure coding

The goal:

- Encode data of K chunks into a larger encoded data of N chunks
- Any K-subset of N chunks can be used to recover the data

---v

### In code

`#![allow(unused)] fn main() { type Data = Vec<u8>; pub struct Chunk { pub index: usize, pub bytes: Vec<u8>, } pub fn encode(_input: &Data) -> Vec<Chunk> { todo!() } pub fn reconstruct(_chunks: impl Iterator<Item = Chunk>) -> Result<Data, Error> { todo!() } }`

### Polynomials

---v

### Polynomials: Line

---v

### Even More Polynomials

---v

### Polynomial we need

We want to have a polynomial, such that:

$$ p(x_i) = y_i$$

Notes:

Question: what is x_i and y_i wrt to our data?

### Lagrange interpolating polynomial

$$ \ell_j(x) = \frac{(x-x_0)}{(x_j-x_0)} \cdots \frac{(x-x_{j-1})}{(x_j-x_{j - 1})} \frac{(x-x_{j+1})}{(x_j-x_{j+1})} \cdots \frac{(x-x_k)}{(x_j-x_k)} $$

$$ L(x) = \sum_{j=0}^{k} y_j \ell_j(x) $$

### Reed-Solomon codes

Congrats! You've just learned Reed-Solomon encoding (almost).

Actual Reed-Solomon codes are defined over finite-fields.

It can detect and correct combinations of errors and erasures.

Notes:

The simplest example of a finite field is arithmetic mod prime number. Computers are quite bad at division by prime numbers. Reed-Solomon codes are used in CDs, DVDs, QR codes and RAID 6.

---v

### Reed-Solomon with Lagrange interpolation

- Divide the data into elements of size $P$ bits.
- Interpret the bytes as (big) numbers $\mod P$.
- Index of each element is $x_i$ and the element itself is $y_i$.
- Construct the interpolating polynomial $p(x)$ and evaluate it at additional $n - k$ points.
- The encoding is $(y_0, ..., y_{k-1}, p(k), ... p(n - 1))$ along with indices.

Notes:

How do we do reconstruction?

## Polkadot's Data Availability Protocol

- Each PoV is divided into $N_{validator}$ chunks
- Validator with index i gets a chunk with the same index
- Validators sign statements when they receive their chunk
- Once we have $\frac{2}{3} + 1$ of signed statements,

PoV is considered available - Any subset of $\frac{1}{3} + 1$ of chunks can recover the data

Notes:

The total amount of data stored by all validators is PoV * 3. With 5MB PoV and 1k validators, each validator only stores 15KB per PoV. With this protocol, we've killed two birds with one stone!

### CandidateIncluded

### Availability Bitfields

Notes:

Each validator actually signs a statement per relay chain block, not per PoV to reduce the number of signatures. These statements are gossiped off-chain and included in a block in a ParachainsInherent.

### Challenge 1

How does a validator know if a chunk corresponds to the committed data?

---v

### Not that Merkle!

### Challenge 2

How do we know if what can be reconstructed from chunks is the same data that was encoded with Reed-Solomon?

- Polkadot uses approval-voting/disputes mechanism for that
- Celestia uses Fraud Proofs
- Danksharding uses KZG commitments

## Data Availability Sampling

Ethereum (Danksharding) and Celestia adopt an approach of Data Availability Sampling, where each light client makes its own judgement of availability by sampling and distributing a few random chunks.

This can eliminate honest majority assumption!

This approach guarantees there's at least one honest full nodes that has the data with high probability.

## Safety of Polkadot's protocol

If we have at most $f$ out of $3f + 1$ malicious + offline validators, then if the data is marked as available, it **can** be recovered.

What if that assumption is broken?

If $2f + 1$ are malicious, every PoS is doomed anyway.

Notes:

We'll see in the next lesson, how approval-voting can prevent unavailable blocks from being finalized even with $>f$ malicious nodes.

## 2D Reed-Solomon Encoding

Notes:

The approach of 2D Reed-Solomon Encoding can reduce the size of a Fraud Proof used by Celestia. But it adds an overhead computationally and on the amount of data stored.

## Comparison with other approaches

- Both Danksharding and Celestia use 2D encoding and DAS
- Celestia doesn't implement data sharding
- Data availability is only part of ensuring validity
- Polkadot's DA is able to process dozens of MB per second

Notes:

Danksharding is aiming at 1.3 MB/s and Celestia < 1 MB/s.

## Ideas to Explore

- Data Availability Sampling for parachain

light clients and full nodes - Consider using KZG commitments
- Reducing the number of signatures to verify
- A Data Availability Parachain

# Questions

### Bonus

- Polkadot uses a field of size $2^{16}$ with efficient arithmetic
- Polkadot uses an FFT-based Reed-Solomon algorithm (no Lagrange)