NVARCHAR Everywhere: A Thought Experiment

Doubling Down on Madness

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

A Video from the Void

The Camps

Camp One: only use VARCHAR. Prior to SQL Server 2019, this is basically the set of people who never have to deal with internationalization. If you’re running solo projects or building systems where you know the complete set of users, and if there’s no need for Unicode, I can understand this camp. For projects of any significance, though, you usually have to go elsewhere.

Camp Two: default to VARCHAR, only use NVARCHAR when necessary. There are a lot of people in this camp, especially in the western world. Most of the companies I’ve worked at live in this camp.

Camp Three: default to NVARCHAR, but use VARCHAR when you know you don’t need Unicode. This is a fairly popular group as well, and outside of this thought experiment, I probably end up here.

Aaron Bertrand lays out the costs and benefits of Camps Two and Three (or, Camps VARCHAR Mostly and NVARCHAR Mostly), so I recommend reading his arguments and understanding that I am sympathetic to them.

But there is also Camp Four: NVARCHAR everywhere. And this camp is growing on me.

Why NVARCHAR Everywhere?

I see several benefits to this:

Developers and product owners don’t need to think about or guess whether a particular string value will ever contain Unicode data. Sometimes we guess wrong, and migrating from VARCHAR to NVARCHAR can be a pain.
NVARCHAR Everywhere avoids implicit conversion between string columns because you can assume that everything is NVARCHAR. Implicit conversion can be a nasty performance impediment.
Furthermore, you can train developers to preface string literals with N, ensure that data access tools ask for Unicode strings (most ORMs either default to Unicode or know enough to do it right), and ensure that every stored procedure string parameter is NVARCHAR because there are no exceptions. That’s one less thing you ever have to think about when designing or tuning a database and one less area where ambiguity in design can creep in.
If somebody tries to store Unicode data in a VARCHAR column, that information is silently lost.

Why Not NVARCHAR Everywhere?

The first thing you’ll hear from people about this is storage requirements: NVARCHAR characters are typically 2 bytes, whereas equivalent VARCHAR characters are typically 1 byte. For the nuanced version of this, Solomon Rutzky goes into great detail on the topic, but let’s stick with the simplistic version for now because I don’t think the added nuance changes the story any.

SQL Server has Unicode compression, meaning that, per row, if the data in a column can fit in your collation’s code page, the database engine can compress the Unicode data to take as much space as equivalent VARCHAR data would—maybe it’s a little bigger but we’re talking a tiny amount. Enabling row-level compression turns on Unicode compression as well and can provide additional compression benefits. And page-level compression does an even better job at saving space on disk. There are CPU costs, but my experience has been that compression will often be faster because I/O subsystems are so much slower than CPU, even with fancy all-flash arrays or direct-attached NVMe.

The exception is if you are using NVARCHAR(MAX) as your data type. In that case, Unicode and row-level compression won’t do anything and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list here.

NVARCHAR(MAX) columns with overflow to LOB will be larger than their VARCHAR counterparts and we cannot use Unicode, Row, or Page compression to reduce storage.
If your max data length is less between 4001 and 8000 characters, you know the column will never have Unicode characters, and the data is highly compressible, you will save a lot of space with VARCHAR plus page-level compression, whereas in this zone, you’d need to use an NVARCHAR(MAX) column and lose out.
If you are in the unlucky situation where even row-level compression tanks your performance—something I’ve never seen but acknowledge it as a possibility—going with NVARCHAR becomes a trade-off between reducing storage and maximizing performance.

The Demo Code

In case you want to try out the demo code on your own, here it is:

USE [Scratch]
GO
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
GO
CREATE TABLE dbo.TestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn VARCHAR(150) NOT NULL,
    SomeOtherStringColumn VARCHAR(30) NOT NULL,
    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED(Id)
);
GO
CREATE INDEX [IX_TestTable_SomeStringColumn] ON dbo.TestTable
(
    SomeStringColumn
);
GO
CREATE TABLE dbo.NTestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(150) NOT NULL,
    SomeOtherStringColumn NVARCHAR(30) NOT NULL,
    CONSTRAINT [PK_NTestTable] PRIMARY KEY CLUSTERED(Id)
);
CREATE INDEX [IX_NTestTable_SomeStringColumn] ON dbo.NTestTable
(
    SomeStringColumn
);
GO

-- Test 1:  It's All ASCII.
INSERT INTO dbo.TestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE('A', 150),
    REPLICATE('X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 150),
    REPLICATE(N'X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 2:  Unicode me.
SELECT DATALENGTH(N'🐭');
SELECT DATALENGTH(N'🪕');
GO
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'🐭', 75),
    REPLICATE(N'🪕', 15)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 3:  Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Check the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Row Compression includes Unicode compression.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = ROW);
GO
-- Test 3a:  Continue to Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Another check of the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Let's check the LOB
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO
CREATE TABLE dbo.NTestTableLob
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(MAX) NOT NULL,
    SomeOtherStringColumn NVARCHAR(MAX) NOT NULL,
    CONSTRAINT [PK_NTestTableLob] PRIMARY KEY CLUSTERED(Id) WITH(DATA_COMPRESSION = ROW)
);
-- Can't use NVARCHAR(MAX) as a key column in an index...
/* CREATE INDEX [IX_NTestTableLob_SomeStringColumn] ON dbo.NTestTableLob
(
    SomeStringColumn
)  WITH(DATA_COMPRESSION = ROW); */
GO

-- No overflow necessary.
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- What about page-level compression?
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- And to be fair, we'll see the same on NTestTable.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- My page runneth over.
TRUNCATE TABLE dbo.NTestTableLob;
-- Let's reset the data compression.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = NONE);
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(10000)
    REPLICATE(N'🐭', 14800),
    REPLICATE(N'X', 28000) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO
-- Now we compress.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- Time to clean up.
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO

SELECT
    N'🐭' as Mouse,
    '🐭' as [Mouse?];

Final Thoughts…For Now

I think where I stand right now is, for greenfield database development, I heavily bias toward NVARCHAR and could even say NVARCHAR Everywhere. I think the benefits outweigh the costs here.

For brownfield database development, it’s a harder call to make because you almost certainly have a mix of VARCHAR and NVARCHAR data types. If you already have a solid system within a brownfield database, stick with that system. For example, you might use NVARCHAR for user-entry fields but VARCHAR for internal system fields like codes and levels. If that pattern works for you, that’s fine.

If you’re in a brownfield development mess, I can see the potential benefit of migrating to NVARCHAR Everywhere, but the work-to-benefit ratio is probably not going to be acceptable for most companies. The exception here is if you find out that you’re losing valuable customer data and need to go through an internationalization project. It might be tempting to change the minimum amount necessary, though my philosophy is that if you have the opportunity to make big improvements, take them.

But as I mention in the video, I’m interested in your thoughts as well. So add them to the video or drop them in here. Is there something big I’m missing which makes NVARCHAR Everywhere untenable? Have I shifted your thinking at least a little bit? Let me know.

8 thoughts on “NVARCHAR Everywhere: A Thought Experiment”

Pingback: NVARCHAR Everywhere – Curated SQL
Solomon Rutzky says:

June 19, 2020 at 8:04 pm

Hi Kevin. Yes, you are totally off-base here. Though to be fair, that is mainly because I just wanted to say that you were totally off-base ;-).

More honestly, I still disagree with “NVARCHAR everywhere”, but maybe not as emphatically as I was being last Monday night during “Shop Talk”. I keep forgetting about data compression, which does help a lot, though doesn’t allow for “NVARCHAR _everywhere_”, as far as I am concerned. Here are some considerations that make it untenable:

1. While I do understand the benefit of making the design processes slightly less complicated, especially when factoring in data access from apps, etc, I still feel that data modeling is, by its very nature, a complex endeavour. I’m not super comfortable with providing a means for someone to not be fully conscious of the design decisions. For example, it is far easier to default integer types to INT and use BIGINT when that doesn’t work instead of tailoring certain columns to be either TINYINT or SMALLINT. This is what lead to PartitionID originally being an INT even though the possible values were 0 – 9, and theoretically might have gone up to 99. This column was in most tables across many servers, accounting for multiple billions of rows. Fortunately, I was able to get it changed to TINYINT before it spread to too many tables.

2. Data Compression helps more data fit per row, but doesn’t impact max row size. So, having everything be NVARCHAR might constrain future design without providing any benefit.

3. Same as previous issue about Data Compression not helping max row size, but this time within the context of indexes. This is a separate line item because a) indexes have a much smaller max size than a row, and b) not all tables have indexes / not all indexes will use string columns as key columns.

4. As far as I can tell, Data Compression does not factor into memory grants. Hence, you might have compressed the data on disk, and even in the buffer pool, but memory grants might be unnecessarily bloated without providing any benefit in cases where this wasn’t needed.

5. The fact that PAGE compression can handle in-row NVARCHAR(MAX) values is interesting but not terribly practical to list as something that makes “NVARCHAR Everywhere” work because if you have a MAX datatype, that should be due to having data that is over 8000 bytes. The rows for which this scenario will not be a drawback of having chosen NVARCHAR when it wasn’t needed probably won’t be enough to really matter.

6. I haven’t researched this for actual impact, but it’s POSSIBLE that it might take up more heap memory (use for function parameters) than necessary.

I think default to NVARCHAR and use VARCHAR when you are 100% certain you won’t need NVARCHAR is fine and reasonable.

But overall, good research 🙂

Take care,
Solomon..

P.S. Post here OR on the video? How about both? (it wasn’t XOR) 😉

Reply
1. Kevin Feasel says:
  
  June 20, 2020 at 10:58 pm
  
  You can post both places, but I’m only going to respond to one of them. Probably why I’m bad at SEO.
  
  1. Integer types differ in that they have a fundamentally different domain, one which is usually clearer than “Will this contain Unicode at any point in the future?” But the funny thing here is that with row compression, all of those integer data types are compressible down to the minimum number of bytes necessary for storage. So if it was *just* about data storage, it wouldn’t be the worst thing ever to pick BIGINT + row compression for everything where you’re in any doubt. But the domain is relevant, and when you declare something a TINYINT, you are signifying a particular range of valid values: that it is logically impossible to expect something outside of this range, and if you try to insert something outside the range, we should get some sort of exception or error. I’d argue that you’re not really doing the same between VARCHAR and NVARCHAR, especially because SQL Server “helpfully” replaces Unicode data with question marks, so you silently lose data rather than receiving an error. I suppose you could write your own string-checking logic to see if somebody inserted Unicode data and throw an exception…but if you’re going that far down the rabbit hole, why not just make it NVARCHAR?
  
  2 and 3. I think the number of cases where this matters is really small, as if I’m reading you correctly, you’d be talking about storing more than 4000 characters of text across columns (ignoring other columns for simplicity’s sake) but less than 8000 characters of text across columns for the table–not just that the data lengths of the VARCHAR/NVARCHAR fields are that long, but that you’re actually *storing* that much and that you will never accept Unicode. These scenarios pop up (mostly in relational OLAP dimensions), but that scenario is uncommon enough that I can admit its truth and downplay its relevance. As far as indexes go, there is a legitimate gripe here in certain text search scenarios: even though Unicode compression would store the NVARCHAR value as the same size on disk as VARCHAR, you can’t insert 1700 ASCII characters into an indexed NVARCHAR column. So there you’d be limited to 850 characters, or 450 if you’re crazy enough to make a long string value the clustered index.
  
  4. I’m not sure about this, but you’re likely right in your conjecture. And that would be a knock against NVARCHAR, one of more relevance than 2 & 3 above. But I also think that the answer to this is “get your character lengths right” rather than “Use VARCHAR.” Declared column size plays a much bigger role in memory grants than the data type itself.
  
  5. I don’t recommend NVARCHAR(MAX) when you expect small values for the same reason you don’t (like, say, the reasoning in point 4), though I did want to explain that unclear little comment in the note on Unicode compression that “Unicode compression is not supported for nvarchar(max) data even if it is stored in row. However, this data type can still benefit from page compression.” So that was maybe a bit of a tangent. Not that I’ve ever gone off on a tangent in my life.
  
  6. No idea on that one, but I’d wonder if, assuming it to be true, it really moves the needle much.
  
  On your follow-up regarding implicit conversion, point taken, though I did literally point out the (bad) default collation that I chose to use for this demo. I’d estimate that the vast majority of SQL Server databases and string columns use SQL collations instead of Windows collations, but you are right that if a developer is using Windows collations, that issue won’t really matter for performance. But the difference in how Windows collations handle this slipped my mind, which is a fancy way of me saying that until you pointed it out, I wasn’t going to remember that I should have known that because I did read your blog post when it came out.
  
  Thanks for your thoughts. I appreciate the opportunity to work through the concept with you.
  
  Reply
  1. Solomon Rutzky says:
    
    June 23, 2020 at 2:27 am
    
    Hi Kevin.
    
    Re: 2 posts for the cost of 1 reply: that’s fine. It would get too messy to duplicate this thread in both places anyway. I just wanted to have at least some counter-point noted for those that found the video and weren’t going to come here to see if there were any replies.
    
    1a. I dunno, I can see a lot of similarity in the issue of “domain”, and the correlation between a range of numeric values: 0 – 255 for TINYINT vs -32768 to 32767 for SMALLINT, etc, and a range of acceptable characters (i.e. character set): ASCII vs a particular code page vs Unicode. If you were thinking that numeric range was more akin to max string size (i.e. NVARCHAR(50) vs NVARCHAR(100)) and thus character set wasn’t applicable, I suppose it’s possible to start with an integer type and someday the business requirements change such that you then need a DECIMAL or FLOAT type, or going from DATETIME or DATETIME2 to DATETIMEOFFSET.
    1b. Two things regarding silent data loss potential from NVARCHAR to VARCHAR:
    1b(i). To be fair, there are three ways of handling invalid conversions: replace with “Best Fit” match (if one exists), replace with default character (typically “?”), or error. While .NET allows you to choose which of these you want to do, SQL Server has decided for us that it will attempt a “Best Fit” mapping, and if one does not exist, then use the default replacement character (i.e. “?”). This is just FYI for those who are only familiar with SQL Server’s behavior.
    1b(ii). It’s not just NVARCHAR -> VARCHAR that can result in silent data loss:
    ————————————————
    DECLARE @DateTime DATETIME,
    @DateTimeOffset DATETIMEOFFSET = ‘2020-11-22T12:34:56.1234567+12:34’;
    SET @DateTime = @DateTimeOffset;
    SELECT @DateTimeOffset AS [@DateTimeOffset], @DateTime AS [@DateTime];
    — 2020-11-22 12:34:56.1234567 +12:34 2020-11-22 12:34:56.123
    
    DECLARE @Int INT,
    @Float FLOAT = 12.345;
    SET @Int = @Float;
    SELECT @Float AS [@Float], @Int AS [@Int];
    — 12.345 12
    ————————————————
    
    2 and 3. I am referring to data modelling and having multiple columns to place in a table. There is effectively an 8060 byte limit per row for the table, and as of SQL Server 2016, 1700 for NonClustered indexes. If I have four 500 “character” columns, plus several others, that will reserve 2000 bytes if declared as VARCHAR, but 4000 if declared as NVARCHAR. Still well under 8060, but it adds up quickly and will sometimes make a different. For indexes, I can create an index with 3 of those 500 character columns if VARCHAR, but with only 1 of them if they are NVARCHAR. I was anticipating scenarios with multiple columns (composite indexes) rather than a single, longish column.
    
    4. For memory grants, isn’t the estimate based on 50% of variable length columns, in bytes? So this is really very similar to the max row size, max index size issue if you are “reserving” twice as many bytes as you need, since SQL Server can’t assume that the actual data is guaranteed to compress.
    
    5. For heap memory, and also for memory grants: I’m not sure exactly how impacting either of these are. I mention them as this is an exercise in potential total impact, so I’m trying to think of any area that can possibly be affected by such a decision.
    
    I’m not saying that any single point here means that such decisions can’t be made. The question is: is this a good / viable practice _in general_. And in that context, I still maintain that, at least for me, it’s not an approach that I would take or promote. This doesn’t mean I believe it will always, or even most of the time, lead to failure. I just think that, at the very least, some additional nuance is required to distinguish between scenarios where this decision could possibly have a noticeable negative impact, and scenarios where the possibility of negative impact is so low as to be impractical to worry about it. Meaning, I think size of the project (data footprint and transactional volume) as well as how many columns are questionable in the first place, need to be considered. If all tables will have less than a million rows, or if only 1% (or less) of columns would likely be VARCHAR, then “NVARCHAR Everywhere” might could work. But then that just sounds like “Default to NVARCHAR, do VARCHAR when sure you don’t need Unicode”. 😉
    
    Re: collations, implicit conversions, and performance: Sorry, I didn’t see / hear mention of bad default collation. Majority of DBs _in the US_, quite likely, but not sure if that would extend to majority in general. It’s only the US English locale that’s honored by having a SQL Server collation as its default (you can see the full list here: https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support#Server-level-collations ). And no worries about not remembering something I posted years ago. I’m aware that most folks aren’t as obsessed with collations / Unicode as I am, so I was just taking the opportunity to summarize the issue for those that weren’t going to read the details.
    
    Reply
Solomon Rutzky says:

June 20, 2020 at 2:39 am

I forgot to mention:

The suggested benefit of “NVARCHAR Everywhere avoids implicit conversions…Implicit conversion can be a nasty performance impediment.” is an extremely common misunderstanding. True, there is a specific scenario in which index seeks won’t be possible, but it’s not due to the “implicit conversion”, and it’s avoidable even with an implicit conversion.

The scenario is having an index with at least one key column that is a) VARCHAR, b) using a SQL Server collation (i.e. name starting with “SQL_”), and c) compared to an NVARCHAR literal, variable, or column. The problem is that the SQL Server collations sort VARCHAR data differently than NVARCHAR data. So, if the index is based on the VARCHAR ordering, then the rows won’t be in the correct order when the predicate becomes NVARCHAR due to datatype precedence. Thus no seeky-seeky ;-).

But, Windows collations (i.e. name _not_ prefixed with “SQL_”) sort both VARCHAR and NVARCHAR the same way. In these cases, even with the index being on the VARCHAR column, the rows are in the correct order even after the implicit conversion. And so you can still get the index seek.

Moral of the story: use Windows collations and not the SQL Server collations (which are really only to offer pre-SQL Server 2000 behavior for backwards compatibility).

For details, please see the following post of mine:

Impact on Indexes When Mixing VARCHAR and NVARCHAR Types ( https://sqlquantumleap.com/2017/07/10/impact-on-indexes-when-mixing-varchar-and-nvarchar-types/ )

Reply
Pingback: LOB Data And Weird Locks – Erik Darling Data
Pingback: PolyBase and Postgres – 36 Chambers – The Legendary Journeys: Execution to the max!
Pingback: Unicode and Data Length – Curated SQL

36 Chambers – The Legendary Journeys: Execution to the max!