NVARCHAR Everywhere: A Thought Experiment

Doubling Down on Madness

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

A Video from the Void

The Camps

Camp One: only use VARCHAR. Prior to SQL Server 2019, this is basically the set of people who never have to deal with internationalization. If you’re running solo projects or building systems where you know the complete set of users, and if there’s no need for Unicode, I can understand this camp. For projects of any significance, though, you usually have to go elsewhere.

Camp Two: default to VARCHAR, only use NVARCHAR when necessary. There are a lot of people in this camp, especially in the western world. Most of the companies I’ve worked at live in this camp.

Camp Three: default to NVARCHAR, but use VARCHAR when you know you don’t need Unicode. This is a fairly popular group as well, and outside of this thought experiment, I probably end up here.

Aaron Bertrand lays out the costs and benefits of Camps Two and Three (or, Camps VARCHAR Mostly and NVARCHAR Mostly), so I recommend reading his arguments and understanding that I am sympathetic to them.

But there is also Camp Four: NVARCHAR everywhere. And this camp is growing on me.

Why NVARCHAR Everywhere?

I see several benefits to this:

  • Developers and product owners don’t need to think about or guess whether a particular string value will ever contain Unicode data. Sometimes we guess wrong, and migrating from VARCHAR to NVARCHAR can be a pain.
  • NVARCHAR Everywhere avoids implicit conversion between string columns because you can assume that everything is NVARCHAR. Implicit conversion can be a nasty performance impediment.
  • Furthermore, you can train developers to preface string literals with N, ensure that data access tools ask for Unicode strings (most ORMs either default to Unicode or know enough to do it right), and ensure that every stored procedure string parameter is NVARCHAR because there are no exceptions. That’s one less thing you ever have to think about when designing or tuning a database and one less area where ambiguity in design can creep in.
  • If somebody tries to store Unicode data in a VARCHAR column, that information is silently lost.

Why Not NVARCHAR Everywhere?

The first thing you’ll hear from people about this is storage requirements: NVARCHAR characters are typically 2 bytes, whereas equivalent VARCHAR characters are typically 1 byte. For the nuanced version of this, Solomon Rutzky goes into great detail on the topic, but let’s stick with the simplistic version for now because I don’t think the added nuance changes the story any.

SQL Server has Unicode compression, meaning that, per row, if the data in a column can fit in your collation’s code page, the database engine can compress the Unicode data to take as much space as equivalent VARCHAR data would—maybe it’s a little bigger but we’re talking a tiny amount. Enabling row-level compression turns on Unicode compression as well and can provide additional compression benefits. And page-level compression does an even better job at saving space on disk. There are CPU costs, but my experience has been that compression will often be faster because I/O subsystems are so much slower than CPU, even with fancy all-flash arrays or direct-attached NVMe.

The exception is if you are using NVARCHAR(MAX) as your data type. In that case, Unicode and row-level compression won’t do anything and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list here.

  • NVARCHAR(MAX) columns with overflow to LOB will be larger than their VARCHAR counterparts and we cannot use Unicode, Row, or Page compression to reduce storage.
  • If your max data length is less between 4001 and 8000 characters, you know the column will never have Unicode characters, and the data is highly compressible, you will save a lot of space with VARCHAR plus page-level compression, whereas in this zone, you’d need to use an NVARCHAR(MAX) column and lose out.
  • If you are in the unlucky situation where even row-level compression tanks your performance—something I’ve never seen but acknowledge it as a possibility—going with NVARCHAR becomes a trade-off between reducing storage and maximizing performance.

The Demo Code

In case you want to try out the demo code on your own, here it is:

USE [Scratch]
GO
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
GO
CREATE TABLE dbo.TestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn VARCHAR(150) NOT NULL,
    SomeOtherStringColumn VARCHAR(30) NOT NULL,
    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED(Id)
);
GO
CREATE INDEX [IX_TestTable_SomeStringColumn] ON dbo.TestTable
(
    SomeStringColumn
);
GO
CREATE TABLE dbo.NTestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(150) NOT NULL,
    SomeOtherStringColumn NVARCHAR(30) NOT NULL,
    CONSTRAINT [PK_NTestTable] PRIMARY KEY CLUSTERED(Id)
);
CREATE INDEX [IX_NTestTable_SomeStringColumn] ON dbo.NTestTable
(
    SomeStringColumn
);
GO

-- Test 1:  It's All ASCII.
INSERT INTO dbo.TestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE('A', 150),
    REPLICATE('X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 150),
    REPLICATE(N'X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 2:  Unicode me.
SELECT DATALENGTH(N'🐭');
SELECT DATALENGTH(N'🪕');
GO
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'🐭', 75),
    REPLICATE(N'🪕', 15)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 3:  Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Check the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Row Compression includes Unicode compression.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = ROW);
GO
-- Test 3a:  Continue to Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Another check of the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Let's check the LOB
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO
CREATE TABLE dbo.NTestTableLob
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(MAX) NOT NULL,
    SomeOtherStringColumn NVARCHAR(MAX) NOT NULL,
    CONSTRAINT [PK_NTestTableLob] PRIMARY KEY CLUSTERED(Id) WITH(DATA_COMPRESSION = ROW)
);
-- Can't use NVARCHAR(MAX) as a key column in an index...
/* CREATE INDEX [IX_NTestTableLob_SomeStringColumn] ON dbo.NTestTableLob
(
    SomeStringColumn
)  WITH(DATA_COMPRESSION = ROW); */
GO

-- No overflow necessary.
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- What about page-level compression?
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- And to be fair, we'll see the same on NTestTable.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- My page runneth over.
TRUNCATE TABLE dbo.NTestTableLob;
-- Let's reset the data compression.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = NONE);
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(10000)
    REPLICATE(N'🐭', 14800),
    REPLICATE(N'X', 28000) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO
-- Now we compress.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- Time to clean up.
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO

SELECT
    N'🐭' as Mouse,
    '🐭' as [Mouse?];

Final Thoughts…For Now

I think where I stand right now is, for greenfield database development, I heavily bias toward NVARCHAR and could even say NVARCHAR Everywhere. I think the benefits outweigh the costs here.

For brownfield database development, it’s a harder call to make because you almost certainly have a mix of VARCHAR and NVARCHAR data types. If you already have a solid system within a brownfield database, stick with that system. For example, you might use NVARCHAR for user-entry fields but VARCHAR for internal system fields like codes and levels. If that pattern works for you, that’s fine.

If you’re in a brownfield development mess, I can see the potential benefit of migrating to NVARCHAR Everywhere, but the work-to-benefit ratio is probably not going to be acceptable for most companies. The exception here is if you find out that you’re losing valuable customer data and need to go through an internationalization project. It might be tempting to change the minimum amount necessary, though my philosophy is that if you have the opportunity to make big improvements, take them.

But as I mention in the video, I’m interested in your thoughts as well. So add them to the video or drop them in here. Is there something big I’m missing which makes NVARCHAR Everywhere untenable? Have I shifted your thinking at least a little bit? Let me know.