Dave Mason has some early qualms about “big data” technologies like Hadoop and Spark. Paraphrasing his argument (and splicing it with arguments I’ve heard from other SQL Server specialists I’ve talked to over the past few years), it seems that a good practitioner can replicate the examples and demos that are out there for technologies like Spark and Hadoop, sometimes even better than Spark or Hadoop can in the original demo. This is especially true whenever someone comes in with a laptop and runs a demo off of a virtual machine (like, uh, me). But I think that there is an important caveat to this critique.
For me, Hadoop & Spark aren’t about “things you simply cannot do otherwise.” There are some companies where the scope and scale work out that there might not be another practical choice, but I’ve never worked at one of those companies. I think that it’s more a question of value versus cost on two levels. First, I can buy ten mid-spec servers a lot cheaper than a single server with 8-10x the power. Second, SQL Server Enterprise Edition is pricey. So if it’s equally easy to build a solution with Spark vs SQL Server, then that—ceteris paribus—is a potential reason to use Spark.
There’s a lot of overlap in data platform technologies and good people can make create workable solutions with reasonable data size and reasonable response time requirements using quite a few platforms. It’s when you start relaxing some of these assumptions that platform technologies really differentiate themselves. For example, suppose I need sub-millisecond query response times for frequently-updated data while retaining ACID properties; if so, I’d lean toward a powerful relational database like SQL Server. If I need sub-second response times for large warehouse queries, I’d look toward Teradata or maybe Azure SQL Data Warehouse. If I need to ingest millions of data points per second in near-real-time, I probably want to combine Kafka with Spark. If I need to process petabytes of non-relational genomic data, Hadoop is king of that show. On the other side, if I need to put a small relational database on millions of embedded devices, I’ll use sqlite or maybe SQL Server Compact Edition. In each of these cases, it’s not so much that it’s literally impossible to envison using Tech Stack A instead of Tech Stack B or that people who start using Tech Stack B will come up with entirely different ways of solving business problems than professionals familiar with Tech Stack A, but rather that the relative trade-offs can make one a more viable alternative than the other.
As a totally different example, I can use Full-Text Indexing and various tricks (like n-grams) to perform quick text search in SQL Server. For some data size, that’ll even work, and if it meets my data and response time requirements, great. But if I’m expected to do full-text search of the Library of Congress in milliseconds, I’m probably at a point where I need something tailored to this specific problem, like Solr.
Nothing New Under The Sun(?)
Aside from restrictive constraints, I want to address in a little more detail the architecture point I made above. Based on my reading of Dave’s post, it sounds like he’s expecting New Ways Of Doing Things across the board. That’s because many of the proponents tend to blur the lines between techniques, architectures, and solving business problems at the highest level.
Technique changes over time. The Google MapReduce paper has spawned a generation of distributed computing techniques and has indirectly led to Resilient Distributed Datasets, kappa and lambda architectures, and plenty more. But if you look at it at a high enough level, the concepts have stayed very similar. We’re still building text processing engines, aggregation engines, and lots of servers which take data from one place and put it into a different place (sometimes making it look a bit different in the process). At that level, there’s nothing new under the sun.
But I don’t consider that a shortcoming; it is an acknowledgement that conceptual solutions to business problems are independent of their technological implementations. Part of our job as data specialists is to translate conceptual solutions to a particular set of technological tools available under given constraints for a particular problem. The new tools won’t open up a previously-undiscovered world of conceptual solutions; instead, they shift the constraints and potentially open doors when we previously would have said “Sorry, not possible.”
That said, I think that there is a benefit in people knowing multiple tech stacks, because that helps us delay or defer the “Sorry, not possible” mentality. That’s because “Sorry, not possible” really means “Sorry, not possible given my best expectation of what servers, processes, technological stacks, potential solutions, the budget, and business requirements look like at this moment.” That’s a lot of hidden assumptions.
Hadoop and Spark Specialties
Wrapping up what was supposed to be a three-line response to Dave on the SQL Server Community Slack, the closest thing I have off-hand to a “thing you simply cannot do otherwise” with Spark is distributed analytics with SparkR/sparklyr or Pandas. You can use a scale-up approach with an R server or a Microsoft R Server instance, but analyze a large enough data set and you’ll eventually run out of RAM. With the enterprise-level version of Microsoft R Server, you can page to disk so the query won’t outright fail (like it will when you’re using the open-source R client or Microsoft R Open), but performance will not be great.
But even then, the mindset isn’t so much “How do we come up with a brand new concept to solve the problem?” as much as it is “How can we relax binding constraints on existing problems?” That, I think, is the critical question to answer, and where you start to see value in these platforms.