Joins and scalability

A couple of days ago, there was an interesting discussion on Twitter about joins and scalability, following Franck Pachot’s article that itself is basically a reply to another article by a NoSQL expert and a DynamoDB evangelist Alex Debrie.

If you can afford to explore the two articles and the Twitter thread in detail, I recommend you do so, it’s really worth it, but if not, here is an executive summary:

  • Alex is arguing that performance of relational databases is a black box and joins inhibit scalability
  • Franck is showing some examples that prove that joins can scale. Also, he makes the point that EXPLAIN PLAN allows one to see what’s going on with a query’s performance so it’s not really a black box.
  • Both authors make some good points. Relational databases have hard time scaling horizontally, and joins can be one of the reasons for that (imagine a sharded database where you frequently need to do cross-shard joins). It’s not too bad for some data models, it’s worse for some others. And in the RDBMS industry there is an entire sub-industry of performance tuning — so while there are tons of tools that can allow you to see what’s taking all this time processing a query, mastering those tools requires lots of time and effort.

    However, the question is not just whether or not joins can make life complicated (oh yes they can), but what do you do about it. And this is where I take an issue with Alex’s article, because he tries to paint a lack of a feature as a feature.

    If non-relational databases are bad at joining data sets it’s not something they should be proud of. We are not just doing table joins because we are relational purists with a compulsive urge to overnormalize everything to atoms. Quite often, we have to join data because it can come from different sources implemented on different modules with different life cycles. You can, of course, force the application to somehow accumulate the pieces, and put them together before you load them into a database — but that’s not eliminating the need for joins, it’s just shipping them to the application level, and that’s not good design.

    For example, if you are managing a hospital, you have a lot of data coming from various subsystems: accounting, HR, pharmacy, patients’ records etc., each feeding its data into the system from its own module. You can organize them into separate schemas, sure, but you’d want to be able to query across all those schemas. And even if we didn’t have different schemas, even if we didn’t have different tables — in many cases we still need to be able to do self-joins.

    And while Alex agrees that lack of joins makes non-relational data solutions less flexible in terms of questions they can answer, he tries to sugarcoat it as “shifting-left data design”, i.e. trying to force database designers to think early (thus the left shift on an imaginary time diagram) about the kind of questions their systems need to be able to answer.

    This almost makes it sound like a good thing, but on the other hand, wasn’t “ask questions you haven’t been able to ask before” the big motto of the NoSQL revolution of the past decade? How can you “turn your data into actionable insights” if you can only ask questions from a short list that was carved in stone by the data architect when your system was first made? Also, it’s quite remarkable how this contrasts with the fuss about unparalleled flexibility of NoSQL because of their schemaless designs.

    Don’t get me wrong, I’m all for thinking early about all sorts of design choices. But e.g. if you are working in a regulated sector, no amount of early thinking can allow you to guess what kind of reports your regulators would come up with in years to come.

    And if we could extend the same kind of logic to the relational world, we could claim that it it’s not bad at horizontal scaling. It simply “shifts-left” capacity decisions! Just make up your mind as to the size of your database and the number of users during the design phase, and stick with it!

    Another thing that it difficult for me to digest in Alex’s blog post is the way he defends DynamoDB’s inability to aggregate data as protecting users from writing bad queries. While certainly this is a way to put a positive spin on something that is not so positive, it is also very confusing.

    I would rather we would all use engineering and not marketing language and call things what they are. NoSQL sucks at joins. RDBMS sucks at horizontal scaling. DynamoDB sucks at aggregation. Each will probably get somewhat better over time, but the gap is likely to remain. That’s fine, because we can combine different technologies into elegant solutions that minimize their respective shortcomings. As long as we remain clear on what each of them can and can’t do.

    Leave a Reply

    Fill in your details below or click an icon to log in: Logo

    You are commenting using your account. Log Out /  Change )

    Google photo

    You are commenting using your Google account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s