Deferred Validation: Safer State-Dependent Writes in Distributed Systems
In distributed systems using consensus protocols like Raft, one of the trickiest challenges is handling operations that depend on the current state—like INCR
—in a safe and deterministic way.
In a previous post, we explained how we transformed non-idempotent operations (like INCR
) into idempotent ones by performing a read-before-write. While that worked for idempotency, we later realized it introduced subtle race conditions.
This post explains how we addressed those issues by deferring validation until the log entry is applied to the state machine.
The Problem with Pre-Log Validation
Previously, we validated commands before writing them to the Raft log. For example, when handling an INCR
request, we would:
- Read the current value of the key
- Compute the incremented value
- Transform the request into a
SET
command - Write the
SET
to the log
This made retries safe and the operation idempotent. However, it introduced correctness problems under concurrency.
Race Condition Example
Imagine three requests arriving around the same time:
SET x 1
INCR x
(reads 1, plans to set 2)SET x "oops"
If these commands are validated before logging, they each appear valid in isolation. But when applied, the actual state may have changed—e.g., INCR
could now be applied to a non-numeric string.
The Fix: Validate at Apply-Time
We changed our system to record the original request in the Raft log without performing any pre-log validation. Then, validation happens only when the log entry is applied to the state machine.
fn apply_command(cmd: Command) -> Result<(), Error> {
match cmd {
Command::Incr(key) => {
let value = state.get(key);
match value {
Some(Value::Number(n)) => {
state.set(key, Value::Number(n + 1));
}
Some(_) => return Err("Cannot increment non-numeric value".into()),
None => state.set(key, Value::Number(1)),
}
}
Command::Set(key, value) => {
state.set(key, value);
}
}
}
By deferring validation, the command sees the actual current state, avoiding stale reads and making operations safer.
Why This Works Better
- Eliminates race conditions: Operations are validated against the real state.
- Simplifies the Raft log: We log intent, not precomputed results.
- Keeps clients simple: No need for clients to read or calculate anything.
It’s a cleaner model: logs record what the client wants to do, and the state machine determines whether it's valid.
Trade-Offs
- Delayed error reporting: Invalid operations (e.g.,
INCR
on a string) are only rejected when applied. - More logic in the state machine: It now handles all validation.
Still, the benefits in correctness and concurrency safety outweigh the downsides.
Conclusion
State-dependent operations like INCR
can’t safely rely on pre-log validation. By moving validation to the state machine apply phase, we eliminate race conditions and make our distributed system more robust and predictable.
From now on, our Raft logs carry only the user’s original intent. The state machine is the sole authority on whether a command is valid—based on the real state at the moment of application.