I spent day 2 of PASS Summit 2015 in Bill Preachuk’s pre-con on Hadoop. The pre-con might have been a little too low-level for me, as I have experience with Hadoop (and my own presentation on the topic), but even then, it’s good to review the basics, especially for a product which changes so frequently.
There were a few important things I picked up from this session. The first one is to use Tez. Using Tez can lead to a significant performance increase for Map-Reduce operations. It’s nowhere near as fast as Spark, but it also doesn’t need to be totally in-memory either.
Another thing I picked up is that data partitioning in Hive is vital. For tables of any significant size, you want at least one partitioning structure in place. What’s interesting about Hive in contradistinction to a relational or Kimball-style warehousing system is that it’s OK to make multiple copies of a Hive table partitioned different ways. For example, you might partition sales data several different ways:
- Year/Month/Day to support date range queries and basic analytics
- Country/State to support breakdown by region
- Store# to support analysis by store
This means creating three copies of the same data, but the idea is that disk is cheap (especially if you’re following the Hadoop JBOD model).
Aside from that, I’ve put some more work into Curated SQL. I have the theme and a few other pieces set up. I’m going to try to work on the workflow and other niceties next, after which point I’ll start blasting away links.