Duva reconnects to peers automatically after reboot — and it feels really good
One of the little things we added to Duva, Rust-powered distributed key-value store, is that when a node reboots, it tries to reconnect to the same peers it was talking to before.
It just works.
No re-bootstrap, no manual config tweaking, no weird “am I part of the cluster?” delays.
The node loads a small file with known peers and tries to reconnect as soon as it boots. If the file's older than some amount of time(should be configurable), it just skips it — no point trying to gossip with ghosts.
It sounds simple, but I didn't realize how nice it would feel until I started rebooting nodes in dev/testing and seeing them instantly slide back into the cluster like nothing happened.
But it wasn’t completely painless
I ran into some interesting edge cases during this:
- How does a failed node know if it was a leader? If the node crashed while it was the leader, and it loads back up with a stale view, it might falsely assume it still leads the cluster. I had to make sure the node always re-validates its role by trying to talk to others before assuming leadership.
- Avoiding conflict with the
replicaof
command: Duva has areplicaof
command that lets you point a node to follow a specific leader manually. But that creates tension: should the node obeyreplicaof
, or reconnect to old peers from before the crash? I had to make sure the reboot reconnection logic respectsreplicaof
if it’s set — basically, user intent wins over automation.
These were fun challenges to debug, and we're pretty happy with how the final setup behaves. It keeps things simple but smart — nodes that go down come back up quickly, and fall back into their roles with minimal fuss.