Q&A: The Curated Data Platform

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.

During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Cross-Platform Data Strategies

The first question was, “What handles the translation between the relational truth system and the Document system?” The context of the question comes from a discussion about product catalogs, and specifically this slide.

Document databases are great for things like product catalogs, where we meet the following properties:

  • Data has a very high read-to-write ratio.
  • You generally look at one item per page—in this example, one product.
  • The set of data to appear on a page is complex and typically has nested items: a product has attributes (title, price, brand, description) but also collections of sub-items (product image links, current stock in different stores, top reviews, etc.).
  • The data is not mission-critical: if updates are delayed or even occasionally lost, that is acceptable.

But I do like keeping a “golden record” version of the data and my biases push me toward storing that golden record in a relational database. I mentioned two processes in my talk: a regular push on updates and an occasional full sync to true up the document database(s).

And that leads to the question of, how do we do that? There are products from companies like CData and Oracle which can handle this, or you can write your own. If your source is SQL Server, I’d push for a two-phase process:

  1. Enable Change Data Capture on the SQL Server instance and have a scheduled task query the latest changes and write them to your document database(s). You can use constructs like FOR JSON PATH in SQL Server to shape the documents directly, or pull in the source data and shape it in your application code.
  2. Periodically (e.g., once an hour, once a day), grab all of the data, shape the documents, and perform a comparison with what’s out there. This will confirm that nothing slips through the cracks for longer than one time period and will keep disparate clusters of document data separated.

Of course, this golden record doesn’t need to be in a relational database—you could store it in a document database and use replication there to push data to different clusters. If you use Cosmos DB, for example, you can replicate to other regions easily.

Document Databases: Scale-Out vs Replication

Another attendee asked about “Document databases and scale-out vs replication.” In retrospect, I think I misinterpreted the question as asked, as I mentioned that scale-out and replication are one and the same: you replicate data between nodes in a cluster to achieve scale-out.

But this time around, I’m going to answer the question, “How do I choose between making my current cluster bigger and replicating out to a new cluster?”

Here are some key considerations:

  • If the issue you are trying to solve is geographical in nature, replicate out to a new cluster closer to your users. In other words, suppose you have a cluster hosted in Virginia. Many of your users are in Japan, so they have to deal with the network latency of pulling data from a Virginia-based data center. If this is the problem, create another document database cluster in Japan and replicate to it from Virginia.
  • If your cluster is in Virginia and is getting hammered hard by relatively local users, scaling out is generally a good option. That is, adding more servers to the existing cluster. Depending on your technology, there will be a maximum number of nodes or a maximum scale-out size available to you, so you’d have to check out those details.
  • If you’re getting close to that maximum scale-out size, it may make sense to replicate to another cluster in the same region and use a load balancer to shift load between the two. I have to be vague here because different technologies have different limits and I’m definitely not an expert on any document database technology’s systems architecture.

Cosmos DB and MongoDB

Another attendee asked, “I have heard that Azure Cosmos DB is built upon an older version of MongoDB – do you know if this is true?”

The answer is no, it’s not. The two platforms are different. I believe where the question comes from is around the MongoDB API for Cosmos DB. For a while, Cosmos DB supported an older version of the MongoDB API, specifically 3.2. That API was released in December of 2015. New Cosmos DB clusters support the MongoDB 3.6 API, which is still active.

But something I want to point out is that the API is an interface, not an implementation. That Cosmos DB supports a specific MongoDB API version doesn’t mean that the code bases are similar; it only means that you can safely (well, presumably safely) interact with both and expect to get the same results when you perform the same set of API steps with the same inputs.

Graph Languages

My last question came from an attendee who mentioned, “I thought GraphQL was the common standard for graph querying.”

The context for this is in my discussion of graph databases, particularly the slide in which I talk about the key issues I have with graph databases. For a bit more background than what I had time to get into during the session, Mala Mahadevan and I have talked about graph databases in a little bit of detail on a couple of occasions, once on the SQL Data Partners Podcast and once on Shop Talk.

As for the question, the comment I had made was that there is no common graph language. We have SQL for relational databases (and other mature data platform technologies) but historically haven’t had a common language for graph platforms, meaning that you have to learn a new language each time you move to a different platform. The Gremlin language is an attempt at creating a single language for graph databases and it’s making enough strides that it may indeed become the standard. But it’s not there yet.

Meanwhile, GraphQL, despite the name, is not a language for graph databases. It’s actually a language for querying APIs. The key idea is that you ask for data from an API and you get back just the data you want. But behind the API, your data can be stored in any sort of data source—or even multiple data sources. In other words, I might expose a product catalog API which hits Cosmos DB, a finance API which hits SQL Server, and a product associations API which hits Neo4j. Those three APIs could all be queried using GraphQL, as it’s up to the API to interpret inputs and return the appropriate outputs.