Modulo Operations In R

I had a discussion with a co-worker the other day about whether the modulo operation always has the same rank in order of operations as multiplication and division.  My comment was that in C-based languages and SQL, this is generally true, but it isn’t always the case with all programming languages.  Case in point, here’s how R handles various scenarios:

  • 9*3%%5 = 27.  In this case, modulo happens before multiplication, so it becomes 9*3.
  • (9*3)%%5 = 2.  I want to show that operations in parentheses happen before operations outside parentheses, just like you would expect in math.
  • 9*(3%%5) = 27.  Again, operations in parentheses happen first.
  • 24/6%%5 = 24.  Like with multiplication, modulo happens before division, so it becomes (24/1).
  • (24/6)%%5 = 4.  The operation becomes 4%%5, which is 4.
  • 24/(6%%5) = 24.  This becomes 24/1, which is 24.

By contrast, with SQL, 9*3%5 = 2 and 24/6%5 = 4.  This means that modulo operations are handled at the same level as multiplication and division, whereas with R, modulo operations are handled before multiplication and division.

So how about exponents?  Surely exponents resolve before modulo, right?

  • 3**5%%3 = 0
  • (3**5)%%3 = 0
  • 3**(5%%3) = 9

So yes, exponents resolve before modulo.

Will Live Query Be Useful?

SQL Server 2016 is going to include a new feature called Live Query. The first time I saw it, I thought it was awesome. But having thought about it a few days, I’m not so sure I will ever use it. Here’s why:

First of all, I almost never use Management Studio for reading execution plans. I use SQL Sentry Plan Explorer for my execution plan reading. They process the plan and give it to me with more useful information and in a much easier to read manner.

Secondly, this feels like a toy. It seems pretty, but I don’t foresee any real value to watching partial data come in, especially because the flows are not guaranteed to be correct. These numbers are still based on estimates (as far as I can tell), so we will see some weird scenarios with one expected row churning and churning.

As long as there’s no performance hit, I don’t mind having this available, but this doesn’t touch my top ten favorite new features in 2016.

Why We Can’t Have Good Things: In-Memory OLTP

I’ve just recently implemented the first memory-optimized tables at our company.  My original plan was to create a few memory-optimized staging tables on our transactional systems, one memory-optimized staging table on the warehouse side, and several memory-optimized TVPs to help with shuttling data around.

This plan quickly fell apart.  Here’s a quick summary of the problems I ran into during the sprint.  I knew about many of them, but there were a couple fascinating new issues I had not yet seen.

  1. Memory-optimized tables are limited to 8000 bytes per row.  Some of my dimensions have LOB data (a memory-optimized no-no) and other things which are not allowed.  That meant that a couple of my TVPs were going to stay on-disk.
  2. I created my memory-optimized queue tables for staging, and we use triggers to load these queue tables.  This isn’t a great thing overall—I’m eventually going to replace the triggers with Service Broker messages—but it’s what I have to deal with.  Anyhow, I ran into a very interesting error message in stored procedures which eventually call triggers which insert into the memory-optimized tables:  “A user transaction that accesses memory optimized tables or natively compiled procedures cannot access more than one user database or databases model and msdb, and it cannot write to master.”  I got this message because our transactional procedures often join to a replicated database to get static data that we need, and that causes an error.  The way around this problem is to use separate transactions, but I don’t really want to do this.  This ruined my using memory-optimized tables on the OLTP systems.
  3. On one of our instances with a very small, fixed transaction log, we found a very interesting problem:  the transaction log was not truncating due to XTP_CHECKPOINT and because it was full, we could not make any alterations due to the log being full.  Oh, “any alterations” includes expanding the log size…  There is an open Connect item which basically says that the message may or may not be a lie.  One person mentioned that taking a full backup fixed the problem, but in our case, we had to restart the database engine service and re-size the log.
  4. As a side note, you cannot take database snapshots if you have a memory-optimized filegroup.  You also cannot drop the memory-optimized filegroup unless you drop the entire database.  This means that, in practice, you cannot get rid of a memory-optimized filegroup once you create it, and once you create it, you’ve limited your available DBA options.

All of this has made In-Memory OLTP in SQL Server 2014 less than ideal for a real production environment.  I’m hoping that 2016 has significant enough improvements that we can expand our use of In-Memory OLTP, but for now, our initial scenario has collapsed down to a few memory-optimized TVPs and one staging table on our warehouse instances.

Reading And Writing CSVs

Our fearless Penguatroll, in his spare time, runs an Out of the Park fantasy baseball league, in which I occasionally remember to participate.  As part of this league, we’ve put together a draft site using OOTP Online Utilities.  Said Penguatroll realized that not all players were showing up as expected, and so it was my task to figure out what was going on.

After fixing a few separate issues, I discovered that the CSV has a flag called draft_eligible, which was set to 0 (false) for certain players even though it should have been set to 1 (true).

Throw Excel At The Problem 

My first stab at the problem involved Excel, which failed miserably.  The reason it failed is that OOTP uses somewhat-strange rules for text qualifiers:  it uses quotation marks even if there are no commas, but only does this for a few fields like first/last/nickname, date of birth, etc.  Excel “helpfully” decides that these quotation marks are text qualifiers and strips them out, never to be seen again.  The problem, however, is that I need these quotation marks, so that leaves scripting something.

Research Time

My first stab at this was F#.  I took advantage of the F# CSV type provider and whipped up the following:

open System
open FSharp.Data
type Players = CsvProvider<"C:\\Temp\\players.csv", Quote = '~', HasHeaders = true>

let main argv = 
    let players = Players.Load("C:\\Temp\\players.csv")
    let playerSample = players.Rows
                        |> Seq.take 100

    for player in playerSample do
        printfn "Name:  %s %s" (player.First_name) (player.Last_name)
    let s = Console.ReadLine()
    0 // return an integer exit code

What this let me do was see the problem, but from there, I was limited in options—mostly due to my lack of F# knowledge. I needed to modify a specific attribute based on a few others (if the player’s age is at least 16, if the league is not one of the professional leagues, etc.), and the type provider’s output is read-only. If this were a 10-attribute type, I would have gone forward, but we’re talking about dozens and dozens of attributes.

With the type provider, I needed to use tilde (~) as the qualifier (Quote) to preserve my quotation marks.  I did not have any tildes in my file, so this was a safe qualifier.

Powershell To The Rescue

Hitting a wall in F# (and giving myself a new problem to work through once I have time to learn more about this), I decided to drop back to Powershell.  I decided to use the Import-Csv cmdlet:

$csv = Import-Csv "C:\Temp\players.csv"
$csvtest = $csv[0]

Once I imported the data set, I grabbed the first line and checked out its attributes. And lo, the quotation marks were gone. This frustrated me greatly, as when I started doing Google searches, they all ended up being cases where people wanted to remove quotation marks rather than preserving them.

What I finally had to do was a bit cheap: I replaced all of the quotation marks with tildes (noting that there were none in my data set) and then imported the CSV, ran my script, and exported the results to CSV:

$csv | %{ if($_.retired -eq 0 -and $_.league_id -NotIn 0,105,108,109,110,135 -and $_.age -gt 15) { $_.draft_eligible = 1 } }
$csv | Export-Csv "C:\Temp\players_new.csv"

From there, I swapped those tildes back out to quotation marks and all was well.  You can automate the character swaps, but because this was a one-time thing, I decided just to modify the file externally.

Pluralsight Reviews: Design Patterns On-Ramp

I just recently finished Jeremy Clark’s Design Patterns On-Ramp.  I was very impressed with the presentation, and I think he did a fantastic job explaining patterns.  His target audience is people with some .NET development experience but who are not necessarily architects or guys with twenty years of development experience.

Clark’s first goal was to describe patterns in general and show how they apply to software development.  From there, he spends a lot of time talking about patterns that we already use in day-to-day C# development.  I think this was a great idea, as Clark significantly reduces the mental cost of understanding patterns, showing that we intuitively have some understanding of the topic today.

After that, Clark introduces a few important patterns which newer developers might not have used, showing that it isn’t difficult to pick up on new patterns once you know what to look for.  Clark even introduces a non-GOF pattern:  the Repository pattern.

For an introduction to patterns, I don’t think you can ask for much more than what Jeremy has on offer.  I highly recommend this course to anybody looking to begin learning about this topic.

Time To Learn R

The time has come for data professionals to learn R.  Microsoft has purchased Revolution Analytics and is betting that they can make R faster, less memory-intensive, and better for enterprise applications.

SQL Server 2016 is also allowing us to run R code against our SQL databases.  The upside to this is fantastic:  instead of pulling SQL data out via ODBC and manipulating it, we can push the (much smaller) R code into SQL Server and process data locally.

So why do you want to get in on this right now?  The reason is that more complementary skills improves your chances of a successful career (where “successful” can mean more money, better benefits, the dress code or hours you want to work, etc.).  Let’s say that you’re a good database administrator who knows SQL Server pretty well.  That’s a nice skill to have and can land you a decent position.  But now let’s say that you’re also really good with statistics and can use tools like R to perform data analysis.  At this point, you’ve moved beyond “good database administrator” and into “really good data professional.”  Guess which one’s more likely to land you that fantastic job offer.

R won’t be everyone’s cup of tea; a lot of people have trouble with or simply don’t like statistical analysis, and in that case, go pick up a different complementary skill.  But if you’re already doing a lot of data analysis on the cheap (queries, Excel spreadsheets, and reports for management), you’re part of the way to learning a valuable skill, and with everything integrating together in 2016, this is a golden opportunity to get ahead of the game.