Upcoming Events

Busy Season is officially underway.  Here’s where you can find me over the next several weeks.

  1. Tonight (2017-03-23), I am giving a webinar for the Southern New England UseR Group.  My topic is data cleansing with R.
  2. This Saturday (2017-03-25), I will be in Phoenix, Arizona for SQL Saturday Phoenix.  I will give two presentations there, one on the APPLY operator and one on using Kafka in .NET shops.
  3. On Saturday, April 1st, I will be in Orange County, California for SQL Saturday OC.  I will give two presentations there, one on the APPLY operator and one on using Kafka in .NET shops.
  4. On Tuesday, April 4th, I will be in Charlotte, North Carolina for the Charlotte BI Group meeting.  My topic there is using Kafka for real-time data ingestion.
  5. On Saturday, April 8th, I will be in Madison, Wisconsin for SQL Saturday Madison.  My topic there is R for SQL Server developers.
  6. On Saturday, April 22nd, I will be in Raleigh, North Carolina for Global Azure Bootcamp.  My topic will be Azure SQL Data Warehouse.
  7. On Saturday, April 29th, I will be in Rochester, New York for SQL Saturday Rochester.  I will give two presentations there, including one on the wide world of the data platform and one on R for SQL Server developers.  As a special bonus, if you’re there, check out Chris Hyde’s talk on SQL Server R Services, which will be right after mine.  We did this pair of talks back to back at another event and they fit together quite well.
  8. On Thursday, May 4th, I will be in Roanoke, Virginia for the Roanoke Valley .NET User Group monthly meeting.  My topic there is using Kafka for real-time data ingestion.
  9. On Saturday, May 6th, I will be in Baltimore, Maryland for SQL Saturday Baltimore.  My topic there is using the APPLY operator.
  10. On Saturday, May 20th, I will be in New York, New York for SQL Saturday New York City.  I’m not sure yet what my topic will be.

That’s my speaking agenda for the next couple of months.  A couple more events might pop up in there, but I’ll call that officially busy.  If you’re able to make it out to any of these events, I hope to see you there.

Advertisements

Game blogging update and Baseball HOFs

I’m going to move future game blogging to my Patreon page. I will occasionally make comments about non-gaming (i.e. sports stuff) here, but the gaming stuff fits more over there. I’ll be starting my YouTube channel soon, so keep an eye out for that!

As far as the Hall of Fame voting, in all honesty, there weren’t that many surprises, unless you count Pudge sailing in despite the controversies of his career. I was pleasantly surprised to see Manny surpass 25% and Bonds and Clemens to break 50%, which means good things for their candidacies. Vladimir Guerrero and Trevor Hoffman now look like locks, although the 2018 class has some strong first time candidates (Chipper Jones and Jim Thome strike me as definite first ballot guys, and I expect Scott Rolen will get in eventually). My favorite case for next year is Jamie Moyer; he won 260+ games, so we’ll have to see what kind of voting he gets. Vizquel will get traction, but probably not make it the first time.

Jorge Posada did better than I expected, getting almost enough to stay on the ballot, but I’m surprised Drew or Orlando Cabrera didn’t get a vote. The Boston thing must not be as strong as I thought.

 

 

The Benefits Of Technological Differentiation

Dave Mason has some early qualms about “big data” technologies like Hadoop and Spark.  Paraphrasing his argument (and splicing it with arguments I’ve heard from other SQL Server specialists I’ve talked to over the past few years), it seems that a good practitioner can replicate the examples and demos that are out there for technologies like Spark and Hadoop, sometimes even better than Spark or Hadoop can in the original demo.  This is especially true whenever someone comes in with a laptop and runs a demo off of a virtual machine (like, uh, me).  But I think that there is an important caveat to this critique.

Trade-Offs

For me, Hadoop & Spark aren’t about “things you simply cannot do otherwise.” There are some companies where the scope and scale work out that there might not be another practical choice, but I’ve never worked at one of those companies. I think that it’s more a question of value versus cost on two levels. First, I can buy ten mid-spec servers a lot cheaper than a single server with 8-10x the power. Second, SQL Server Enterprise Edition is pricey. So if it’s equally easy to build a solution with Spark vs SQL Server, then that—ceteris paribus—is a potential reason to use Spark.

There’s a lot of overlap in data platform technologies and good people can make create workable solutions with reasonable data size and reasonable response time requirements using quite a few platforms. It’s when you start relaxing some of these assumptions that platform technologies really differentiate themselves.  For example, suppose I need sub-millisecond query response times for frequently-updated data while retaining ACID properties; if so, I’d lean toward a powerful relational database like SQL Server.  If I need sub-second response times for large warehouse queries, I’d look toward Teradata or maybe Azure SQL Data Warehouse.  If I need to ingest millions of data points per second in near-real-time, I probably want to combine Kafka with Spark.  If I need to process petabytes of non-relational genomic data, Hadoop is king of that show.  On the other side, if I need to put a small relational database on millions of embedded devices, I’ll use sqlite or maybe SQL Server Compact Edition.  In each of these cases, it’s not so much that it’s literally impossible to envison using Tech Stack A instead of Tech Stack B or that people who start using Tech Stack B will come up with entirely different ways of solving business problems than professionals familiar with Tech Stack A, but rather that the relative trade-offs can make one a more viable alternative than the other.

As a totally different example, I can use Full-Text Indexing and various tricks (like n-grams) to perform quick text search in SQL Server. For some data size, that’ll even work, and if it meets my data and response time requirements, great. But if I’m expected to do full-text search of the Library of Congress in milliseconds, I’m probably at a point where I need something tailored to this specific problem, like Solr.

Nothing New Under The Sun(?)

Aside from restrictive constraints, I want to address in a little more detail the architecture point I made above.  Based on my reading of Dave’s post, it sounds like he’s expecting New Ways Of Doing Things across the board.  That’s because many of the proponents tend to blur the lines between techniques, architectures, and solving business problems at the highest level.

Technique changes over time.  The Google MapReduce paper has spawned a generation of distributed computing techniques and has indirectly led to Resilient Distributed Datasetskappa and lambda architectures, and plenty more.  But if you look at it at a high enough level, the concepts have stayed very similar.  We’re still building text processing engines, aggregation engines, and lots of servers which take data from one place and put it into a different place (sometimes making it look a bit different in the process).  At that level, there’s nothing new under the sun.

But I don’t consider that a shortcoming; it is an acknowledgement that conceptual solutions to business problems are independent of their technological implementations.  Part of our job as data specialists is to translate conceptual solutions to a particular set of technological tools available under given constraints for a particular problem.  The new tools won’t open up a previously-undiscovered world of conceptual solutions; instead, they shift the constraints and potentially open doors when we previously would have said “Sorry, not possible.”

That said, I think that there is a benefit in people knowing multiple tech stacks, because that helps us delay or defer the “Sorry, not possible” mentality.  That’s because “Sorry, not possible” really means “Sorry, not possible given my best expectation of what servers, processes, technological stacks, potential solutions, the budget, and business requirements look like at this moment.”  That’s a lot of hidden assumptions.

Hadoop and Spark Specialties

Wrapping up what was supposed to be a three-line response to Dave on the SQL Server Community Slack, the closest thing I have off-hand to a “thing you simply cannot do otherwise” with Spark is distributed analytics with SparkR/sparklyr or Pandas.  You can use a scale-up approach with an R server or a Microsoft R Server instance, but analyze a large enough data set and you’ll eventually run out of RAM.  With the enterprise-level version of Microsoft R Server, you can page to disk so the query won’t outright fail (like it will when you’re using the open-source R client or Microsoft R Open), but performance will not be great.

But even then, the mindset isn’t so much “How do we come up with a brand new concept to solve the problem?” as much as it is “How can we relax binding constraints on existing problems?”  That, I think, is the critical question to answer, and where you start to see value in these platforms.