Appearing originally on his blog, Adrian Colyer embarks on a journey to get out of the fire swamp when it comes to data persistence and why these apps seem to be breaking down more and more.
(*) with apologies to Moseley, Marks, and Westley.
Something a little different to the regular paper reviews for the next three days. Inspired by yesterday’s ‘Consistency without Borders,’ and somewhat dismayed by what we learned in ‘Feral Concurrency Control’, I’m going to attempt to pull together a bigger picture, to the extent that I can see it, that emerges out of many of the papers we’ve been studying. There is more to say here than I can shoehorn into the review of any one single paper!
The Data Crisis
We have a “data crisis”. Most of the applications that we write that work with persistent data are broken, and the situation seems to be getting worse. How did we get into this state and what can we do about it?
Let’s consider some of the evidence.
Even applications that work with a traditional RDBMS are prone to integrity violations
- The gold standard of serializable ACID transactions in a trusted RDBMS is either unavailable or not used in practice when it is. (See table 2 in Highly Available Transactions: Virtues and Limitations for a good summary of default and maximum isolation levels in a variety of common databases). Oracle’s strongest isolation level is something called Snapshot Isolation for example, which still permits anomalies. The default isolation level for databases is very rarely serializable. It should go without saying that none of the weaker-than-serializable isolation models guarantee serializability, but the performance benefits are considered to outweigh the potential costs of anomalies.
- A relational database that either doesn’t offer serializability, or is configured to use a lesser isolation level, will be vulnerable to anomalies – and we can even build models to predict how many. It was a great piece of marketing to call them ‘anomalies.’ When you say ‘will be vulnerable to violation of application integrity constraints’ it doesn’t sound nearly so cute and harmless.
- As we saw in Feral Concurrency Control, programmers don’t even take advantage of the guarantees that relational databases do give. Feral application level concurrency control mechanisms through validation and association annotations and scant use of transactions is becoming increasingly common. Fundamental integrity constraints such as uniqueness are broken under such a model (even if you did have a database offering serializability on the backend).
- I’ve yet to meet a programmer that performed a careful analysis of all the anomalies that could occur as a result of the selected isolation level and chosen concurrency control mechanisms, and validated those anomalies against the business transactions and integrity constraints to ensure that either they were acceptable or a recovery strategy was in place if they were to occur. This is very difficult to do in general, so the more common approach is just to pretend that anomalies don’t exist.
It gets far worse when we venture into the land of eventual consistency
This of course is all just the relational database world, our ‘safe place’ where we like to think everything is consistent (it isn’t). Then along came the cloud and the CAP theorem. We need availability, low-latency, partition tolerance, and scalability. Because we couldn’t avoid partitions we rushed headlong into a world of eventual consistency.
- Under eventual consistency we will see many more anomalies (application integrity violations): we can model how many.
- Programmers that do try to wrestle with this find it difficult, many I suspect still follow the ‘pretend they don’t happen’ strategy, last-writer-wins, it’ll be fine…
- Since partition tolerance was one of the key drivers for this change, it seems a big shame that many implementations don’t handle partitions very well at all – adding data loss to our data corruption issues. (Remember, partitions will happen).
- Performance (low-latency) was another key driver – best to look good out of the box so why not default to W=1 asynchronous writes? You probably won’t lose any (much) data right? (This seems to be the NoSQL parallel to RDBMSs choosing a lesser isolation level by default).
Your database is fried from all the guarantees that lied. I know you tried, but do you have any ethics or pride? / You think you built a database but all you made was a dark place where writes never use disk space.
Kelly Sommers, tweet storm excerpts, April 2015.
- That’s the I and the D from ACID hosed. Bad news on the A front too: many eventually consistent systems don’t support multi-item/partition transactions, so that’s another set of integrity constraints we won’t be able to guarantee.
Two constraints we need to learn to live with
Can’t we have safety and performance? (and availability, and partition-tolerance?). What’s the way out of this mess? How do I build a modern application on a stack I can put into production without exposing myself to data loss and corruption?
There are two very interesting and relevant results from some of the papers that we’ve looked at in The ‘Morning Paper’.
- We’re always going to have anomalies. In any always available convergent system we can’t get any stronger consistency that causal+ consistency. (See Consistency, Availability, and Convergence).
- We can’t solve the data crisis solely by building a better datastore – we’re going to need some cooperation at a higher level. (Consistency Without Borders, Feral Concurrency Control).
Let’s unpack those statements a little bit:
Anomalies are Inevitable
We didn’t get better at dealing with failure by pretending it doesn’t happen. Quite the opposite in fact – we realised that the more components in a system, the more likely failure is, to the point where we need to explicitly design for it in the understanding that failure will happen. And to force our hands we introduced Chaos Monkeys and a Simian Army.
We’re not going to get better at dealing with application integrity violations by pretending they don’t happen. Quite the opposite in fact – the more concurrent operations we have (and the weaker the isolation level), the more likely application integrity violations are, to the point were we need to explicitly design for them in the understanding that application integrity violations will happen. And to force our hands, perhaps the Simian Army needs a new monkey – Consistency Monkey! Liveness failures are easy to spot (a process dies and becomes unresponsive, bad things happen). Safety failures lurk in the shadows, we need to work to make them equally visible.
What does it mean to design for application integrity violations? First off, let’s assume that we’ve done everything practical to avoid as many of them as possible, and secondly let’s assume that we know which anomalies that still leaves us vulnerable to (I’ll come back to these topics later). Pat Helland then gives us a model for thinking about this situation based on Memories, Guesses, and Apologies. You do the best you can (guesses) given your knowledge of the current state of the world (memories), and if it turns out you got something wrong, you apologize.
Note that you can’t apologize if you don’t know you’ve done something wrong! So we need a process that can detect integrity violations when they occur, and trigger the apology mechanism. If we’re smart about it we can automate the detection of violations and invocation of a user-defined apology callback – but what it means to apologise will always be specific to the application and needs to be coded by the developer.
Instead of trading off performance and data corruption, we need to be trading off performance and the rate of apologies we need to issue.
We can’t just build a better datastore
Reasoning about consistency at the I/O level as sequences of reads and writes is too low level and too far divorced from the application programmer’s concerns. Application programmers find the mapping too obscure to be of practical use in figuring out, for example, what anomalies might occur and what they might mean in application terms. And as we’ve seen, so long as we stick with that model, programmers will most likely to continue to eschew what the database provides and implement feral concurrency controls. At the same time, loss of semantic information about what the application really is trying to achieve leads to inefficiencies due to over-conservative coordination. Building a better datastore without also building a better way for applications to interact with the datastore isn’t going to solve the data crisis.
It all starts with the application
If you reflect on some of the papers we’ve been looking at, an interesting pattern starts to emerge that is very application-centric:
- We get maximum performance out of distributed systems by minimizing the amount of coordination required. And the way to do that is through an application-level understanding of the application’s invariants. This is the lesson of I-Confluence.
- We can’t get stronger consistency than causal+ consistency. Out of the box this can give us the ALP of ALPS (availability, low-latency, partition-tolerance) but it struggles with Scalability under the default potential causality model. To get scalability back, we need to move to anexplicit causality model. In other words, we need the application to explicitly tell us the causal dependencies of a write.
- To compensate for anomalies that do occur, we need an apology mechanism. Apologies can only be defined at the application level.
- Programmers have voted with their feet and told us that they want to control application integrity in tandem with the domain model.
- Reducing the need for coordination through the adoption of distributed data types (e.g. CRDTs and Cloud Types) requires the cooperation of the application programmer.
In Part II we’ll explore further what it could mean to address the data crisis from the application down.