Why CSP Matters II: How Do I Know Sync Works?

How does one know if their synchronization algorithm works? I have a real-time synced grocery list app. In the last post I attempted to scrape by on synchronization by implementing differential sync based on an intelligent-sounding paper. In the end my implementation had subtle bugs and was difficult to debug. I hoped to find a better approach to the problem using the CSP language.

One challenge is just to make the problem understandable. If the CSP language can represent the problem clearly and succinctly, it already makes it easier to manipulate the algorithm, discover problem cases, and implement it properly. But can CSP actually prove that synchronization works? And perhaps more importantly, is it worth the effort?

CSP and CSPM

CSP has a long and rich history in research and there are doubtless countless uses and resources of which I haven’t the foggiest. The books I found that I liked were Roscoe’s “The Theory and Practice of Concurrency” and Steve Schneider’s “Concurrent and Real Time Systems: the CSP approach”, in addition to The CSP Book from Hoare himself. But the important thing for me came from Roscoe’s book, which introduced me to the CSP

CSP is a mathematical language for representing processes that talk to each other. It allows you to model problems and then make assertions about those models. Modeling can be useful, as it helps you think about the problem, and assertions are grand. If you can make it work. Eventually. But to me this felt difficult to apply and not very well scalable to more complicated problems.

What CSP looks like

But CSP

So instead of writing intricate mathematical algorithms, we can just write programs and debug and test them. Better still, the CSP

A simple algorithm: syncing broccolis with CSPM

The goal is to keep grocery list items in sync between different clients, connected to a central database via one or more servers. Here is what the algorithm looks like, using broccoli:

Broccoli algorithm

The rest of this section walks through representing the broccoli algorithm using CSP

Client sync process

For the client, I define two different processes which essentially behave as one process: CLIENT and CLIENT’. The CLIENT process either uploads a diff to the server when the user performs an action or behaves just like the CLIENT’ process. The CLIENT’ process takes a diff from the server, renders it, and then behaves like the CLIENT process. The t variable represents the client’s state and is incremented each time data is saved to the database. The client process has three events it can engage in: up, down, and render. Each event has a client number associated with it to identify the client and a state representing the database “time”.

How do we know if we implemented the process correctly? You can interactively explore the states of the process using :probe CLIENT(0, 0), which gives you a dialogue like this:

The probe dialog shows the possible actions our process can engage in at any given state. It’s easy to run through the diagram, try out different scenarios, and assure oneself that the process works how it was supposed to.

Server sync process

The server sync process does two things. First, it takes in input from the client via the up event, saves it to the database, and sends any remaining changes back down to the client. Second, it reads in changes from other clients via (our representation of) the Datomic report queue and sends them down to the client, if they haven’t already been handled.

Database process

The database process DB is very simple. It takes a new state to save from the server sync process i and saves it, returning the id t representing the database after being saved. The report queue process REPORTQUEUE takes these saved events and puts them in a sliding buffer of size 1 for other server sync processes to consume. This is how changes are passed from one client to another. It is a sliding buffer of size 1 because we only care about the latest state of the database. If we don’t have time to process an event in between it doesn’t matter – we just diff with the latest state we have available and send that down to the client.

Process wiring

Finally, everything is wired together in the end.

All processes start with an initial state t=0. The CLIENT process communicates with the SERVER process with the up and down events. The server process in turn communicates with the REPORTQUEUE process via the report_queue event. Each client/server connection pair, represented by the CONN process, receives saved events synchronously from the database. Finally, all of the connections communicate with the database via the save and saved events.

Putting it all together

As I left out a few details, here is the full example:

So now we have a model representing our implementation. This is not the simplest model we could have come up with, nor is it the most complex. There are a lot of details that we left out. Client-server connections are not actually synchronous, even though we represent them as so. There is a certain amount of buffering in the system that is not modeled. We consider the diffing and patching algorithm only in very vague terms, passing around a single variable t to represent the database state. We don’t represent the possibility of lost packets or of the client disconnecting. We don’t represent users making changes as a separate event, instead tying it in directly to the moment the changes are sent to the server via the up event.

This is a simplified representation of the problem but already a useful one. Our algorithm is concise and unambiguous. We can visualize the algorithm and iteratively exploring the states of the algorithm using FDR3’s :probe SYSTEM.

But beyond that, this allows us to prove things about our algorithm.

Proving synchronization works: Sync one broccoli

What would we like to say about our algorithm using this model? Well, we want to prove that syncing “works”. What does it mean for syncing to “work”? Let’s start simple. Syncing works if I enter a broccoli on my phone and, at some point later, my wife sees broccoli on her phone. No other users or vegetables are involved.

Syncing a single broccoli

How can we verify this using our CSP

1. Allow only one change from one client

First, we want to restrict our implementation to allow only a single change from a single user. We do this by defining a processes that perform according to this specification and then forcing our implementation to synchronize with these processes on the relevant events.

Our first process restricts the implementation to only a single change:

The MaxInputs process performs n changes from any client i and then stops. If we set n=1then it only allows one change. The question mark in up?i?t indicates that we are taking in i and t as inputs, which in this case we ignore. The SKIP event represents successful termination of the process.

A second process specifies that the input must come from a specific client:

The OnlyClient process takes in inputs infinitely, but only from client i. The important syntactical difference here is in up!i?t. The exclamation point indicates that we are forcing i to be the specified client i i.e. only the one client is allowed to make changes.

To force our algorithm to behave according to these constraints, we tell it to synchronize with our constraint processes on the up event:

The system can only engage in up events when both of our other processes are ready to engage in up events. Thus the system is only allowed to take in a single input from client 0 and then cannot take any more inputs. Our constraints only affect the inputs, so it can still freely do whatever else that it wants, like upload to the server or save changes to the database. To verify that our constrained process works properly you can explore it with :probe OneInputFromClientZero.

2. Assert that the change makes it to the other client

Now, we want to show that a change on client 0 will make it to client 1. What does this look like from the outside i.e. what sequence of events would we expect to see if this works? We would expect to see an input/upload event from client 0 followed by a render event from client 1: <up.0.0, render.1.1>. In between a bunch of other events happen that we don’t care about right now.

Here is a process that does that:

The SyncOneInput process is our specification. This is how we hope our system will behave. We hope it will take input from client 0, render the state on client 1, and stop.

Finally, we prove that our system fulfills this specification by asserting that it does:

What we assert here is that our constrained implementation OneInputFromClientZero must behave according to our specification SyncOneInput. If our specification takes in one input and sends it up to the server, our implementation must take in one input and send it up to the server. If our specification then, at some point after this, renders a new state on the other client, our implementation must also at some point render a new state on the other client.

We can run the assertion through FDR3 and see that it works. Syncing one broccoli works. This is like a simple end to end test for our algorithm. But let’s make it more interesting.

Proving synchronization works: Sync many vegetables

This app isn’t very useful if it only sends one broccoli from one person to another. What if I throw in tomatoes and my wife throws in milk and our friends who are visiting want zucchini and butternut squash?

Syncing many vegetables from many people

Instead of only allowing one client to make a single change, let’s allow any client to make any number of changes, up to maximum number of changes. They stop adding things at some point. Then we verify that each client properly renders the final changes. This is essentially a definition for eventual consistency.

First, we need to allow a larger number of inputs n instead of just one input. We can use our MaxInputs process from above:

Next, we need to define what we actually want to see in our specification. If the system is restricted to n inputs, what we would like to see after all of the inputs have been entered is for each of the connected clients to render the final state. For example, if 9 inputs are allowed, we allow 9 inputs from arbitrary clients and then expect to see all of the clients to render the state after 9 changes i.e. engage in the render.i.9 event. Conceptually this is still a very simple specification, though the syntax gets more involved:

Without going into the syntax, this allows input nondeterministically in any order from any client and then requires them all to render the final state in nondeterministic order.

The final assertion can be run with FDR3 and it passes. We can even change the number of clients from two to four and see that it still passes.

What have we proved here? A similar end-to-end test might work as follows. We pull up four browsers. We randomly make n changes on a random window, stop for some period, and then verify that all of the browsers end up in the same state. This is a useful test. It shows that the system, including the parts outside of the algorithm, works in this specific situation. It means you probably didn’t screw up anything majorly.

But ultimately that test is also very limited. It randomly tests one single possible timing of events out of a huge space of possible timings of events. The exact behavior of the system, and whether it works or not, may depend on very specific timings. The test only tests one of these timings. The next time it runs it tests a different timing of events. Even if it fails, we have no good way of repeating the test. We just know that our system is “broken”. Somehow.

Our CSP assertion is much more powerful than an end-to-end test. Instead of testing one specific situation, it tests every single combination of timings that can occur with our system and says that in all cases our system will end up in sync. If the assertion passes, we know that the algorithm should work for any timing of events. If the assertion fails, it provides a single clear example of a sequence of events that causes the assertion to fail.

Conclusion: Does my sync algorithm work?

Our “eventual consistency” assertion says that given n connected clients, if we allow them to arbitrarily make k changes, eventually all clients will render those changes and stop changing. In this case I tested it with 4 clients and 9 changes, but this can be arbitrarily adjusted to any number of clients or changes as long as computing power suffices. On my computer two clients take a moment to verify, three take 36 seconds, and four take over twenty minutes. The parameters can be adjusted until you’re sure that the algorithm actually does work.

Note some of the things that this assertion does not say. It does not make any claims about timing. In theory synchronization could take a huge (finite) number of steps. Nor does not make any claims about the diffing/patching algorithm itself. We just assume it works.

The second, important part of verifying that the sync algorithm works is to make sure that the model actually represents our app in the important ways. It doesn’t help to write one model and implement another. It’s also easy to overlook real world aspects of the system like buffering or disconnections. Some of these could be handled simply enough outside of the model, as long as our model assumptions hold. Others might need to be included into the model.

Because the Clojure core.async library is based on CSP, it is a straightforward process to translate our model into Clojure. My final sync algorithm as it currently stands looks as follows.

Client sync process (source)

Server sync process (source)

I think it works. I have a few niggling doubts about the interaction between buffering and the diffing algorithm. But if it doesn’t work, I have a solid way to make it better. It might not work because the implementation doesn’t match the model, in which case I update one or the other. Or it might be that the model itself is flawed. If the model is flawed, I should be able to write an assertion that fails and then change the model until it does not fail. All of these changes build on each other. The model can be iteratively improved without the need to start over from the beginning.

Try it out here. Pull up two browsers on the same URL and see if they sync. Use it for your family grocery run. If it doesn’t work, let me know. Or better yet, show how it doesn’t work based on the CSP algorithm, or by showing how the implementation doesn’t match the algorithm.

Why CSP matters

The point here isn’t that my CSP representation is a perfect model of my app and that my app must work in all situations because the model works. The point is that this CSP approach makes writing complicated concurrent or distributed systems easier. It is a relatively lightweight model checking approach that falls more in the category of advanced type systems than formal verification. We don’t need to manually manipulate math to see that our algorithm works. We just write a program in a funny (but modern) language, write tests in the form of assertions, and have our model checker brute force it to see if our tests pass or fail.

Porting the resulting CSP model to Clojure is relatively straightforward because Clojure’s core.async library is based on CSP. It would be even more straightforward for a CSP implementation in a language closer to CSP

This is not a silver bullet. It’s easy for models to grow large enough to make them time consuming to check, for instance. The practicality of this approach for my day-to-day problems is an open question. But the promise is tantalizing. Once I had sufficient understanding of how CSP and CSP

This is why CSP matters.

1. What we are actually specifying is much more precise than my hand-waving explanation. We are saying that our specification process failures/divergences refines our implementation process. For further detail I direct you to the Roscoe book, which provides precise mathematical definitions.