Working through Gossip Glomers in Racket
Gossip Glomers is a series of distributed systems programming challenges from Fly.io. It uses Maelstrom, a platform for describing test workloads that can run your programs as distributed systems nodes. Maelstrom workloads can provide inputs to these nodes (as if they are arriving over a network), inject delays and partitions and then check that your system still satisfies the invariants of each challenge.
So far I have made it through 3 of the 6 challenges1. I used Racket for all the challenges. I have released a Maelstrom wrapper for Racket as a library you can use if you would like to attempt the challenges in Racket. It provides some nice abstractions to implement a node.
Warm up
Challenges #1 and #2 are pretty straightforward if you are comfortable programming, since they are a warm up. Challenge #1 is a simple Echo node, that just gets you familiar with writing a node. In my case, that was several hours worth of effort as I was implementing the library from scratch and simultaneously getting familiar with Racket’s concurrency and I/O primitives.
Challenge #2 requires generating unique IDs. For this, a UUID would also work, but I did another simple thing. Each node knows its ID, and the node ID is unique. So each node just maintains a counter and the resulting ID is a concatenation of the node ID and the counter.
Since each generate
message’s callback can run in a separate (green-)thread, the counter is wrapped in a box
that does atomic compare and swap to increment it.
Gossiping
Challenge #3 is where things start to become fun. The challenge deals with both network partitions and the later sub-challenges require certain latency bounds in the presence of delays. I built up a set of solutions to this.
For #3a, nodes do not communicate with each other. So this was simple. I used a semaphore to guard the list of known values.
In #3b, peers gossip by forwarding broadcast
messages to other known peers (except the sender). To achieve completion, only new messages are forwarded. This is fine when messages are guaranteed to be delivered, and there are only 5 nodes in the network, so performance isn’t a concern.
Racket’s concurrency (not parallelism) model is based on green threads, Go-like channels, and ConcurrentML based selection between multiple concurrent events. Nearly all the basic I/O and messaging primitives (threads, channels, ports and others) can act as selectable events. Threads and channels are very cheap to spawn. Idiomatic Racket practice is to have individual threads manage state, and other threads querying or mutating the state through messages on channels. Not only does this avoid shared state, it lets programs be kill-safe 2.
For #3c, I took the more idiomatic approach of having a dedicated thread for maintaining the list of seen values. In addition,
network partitions come into play in #3c. With that comes the need for acknowledgements and retries. This is where the rpc
function comes in handy. For every new value received, the node keeps a set of peers that still haven’t acknowledged receiving the new value. It then sends an rpc
message with the new value to each of them. When a response is received (the partition has healed), the peer is removed from the set. The loop is retried until all peers have acknowledged receipt of the message.
For 3c, I also read the topology
message, instead of relying on known-peers
. The intent was to practice reading this message as it would be required in future sub-challenges. I must admit, I still haven’t understood the purpose of topologies, since it seems all nodes can always communicate with each other in the broadcast challenge. In hindsight, I think all of Challenge #3 can be solved while ignoring the topology
message completely.
The solution in 3c works, but is very inefficient, as one message is sent per new value, and we keep sending these messages until acknowledged. This means the number of messages that are being sent scales linearly with how long the partition remains active.
Challenges #3d and #3e require solving the same problem more efficiently. #3d has some msgs/op and latency bounds. #3e decreases the msgs/op limit, but increases the latency bound. These latency bounds only apply without partitions. The system can be more inefficient during partitions, but it must still eventually converge. In my approach, but challenges use the same code, with the retry timeout tweaked for #3e.
For #3d, I introduced a new message type broadcast_multi
that is only sent between nodes (instead of being sent by Maelstrom like broadcast
). This sends a list of multiple values. Each node spawns one thread for each peer it will forward to. This thread maintains 2 sets:
- The set of all messages known by this node. This can be shared among all the peer managers, but it isn’t right now.
- The set of all messages acknowledged as received by the peer.
And spins on 3 events:
- New values arrive at this node. In this case, the first set is extended.
- A timeout triggers, which starts another attempt at sending the peer the list of all values it has not yet acknowledged. This is the difference between set 1 and set 2.
- The RPC receives an acknowledgement. In this case the second set is extended with the values we know were sent in that RPC.
This makes the msgs/op drop dramatically as outgoing values are batched and only sent a few times a second. The timeout can be tweaked based on latency requirements.
My peer selection approach here is flawed but works in the limited attempts so far. I select a random subset of n/2 + 1 peers. However it is possible for random selection to go wrong here and leave some nodes disconnected from others. I may get around to revising it. A quick look at some Gossip protocols literature suggests using random peer selection per message, but keeping a track of known values for all nodes may be reasonable for this implementation. Selecting new peers per message should ensure that some nodes are not permanently ostracized.
I have yet to start on Challenge #4. Gossip Glomers is a really fun exercise, and I’ve had a great time doing it in Racket. ConcurrentML clearly has a great approach to concurrency, and I have barely scratched the surface. Things like nack-guard-evt
for cancellation is something I need to look at more. I’ll be writing a separate article on concurrency in Racket once I have more experience and things to say.