Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?

Question

I am trying to convert data in a varchar column to XML but I was getting errors with certain characters. Running this ...

-- This fails
DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(@Data AS XML) AS DataXml

... results in the following error

Msg 9420, Level 16, State 1, Line 3
XML parsing: line 1, character 55, illegal xml character

It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8. Looking at the XML spec it appears to be valid.

When I change it to this ...

-- This works
DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(CAST(@Data AS NVARCHAR(MAX)), 'encoding="utf-8"', '') AS XML) AS DataXml

... it works without error (replacing encoding string to utf-16 also works). I'm using SQL Server 2008 R2 with SQL_Latin1_General_CP1_CI_AS Coallation.

Can anyone tell my why I need to convert to NVARCHAR and strip the encoding="utf-8" for this to work?

Thanks,

Edit

It appears that this also works ...

DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(@Data, 'encoding="utf-8"', '') AS XML) AS DataXml

Removing the utf-8 encoding from the prolog is sufficient for SQL Server to do the conversion.

SQL Server doesn't support UTF-8 - that's why you need to convert to `NVARCHAR` (which is UCS-2 = UTF-16) and then to XML — marc_s, Aug 05 '14 at 11:27
Thanks @marc_s. I didn't realize that UTF-8 wasn't supported. In that case is it considered bad practice to store UTF-8 encoded XML in a varchar column? The data is just being logged and not read / manipulated / written to frequently. — Gavin Sutherland, Aug 05 '14 at 12:08
Why not store XML in an `XML` column instead of a `VARCHAR` column? — Remy Lebeau, Aug 07 '14 at 02:09
Hi @RemyLebeau. The log column is going to be used for more than xml so it can't be xml type. I'm using SQL 2008 ... isn't that only available from 2012 onwards? — Gavin Sutherland, Aug 07 '14 at 14:10
@GavinSutherland: the `XML` data type was introduced in SQLServer 2005. — Remy Lebeau, Aug 07 '14 at 14:50
@RemyLebeau. You're right. Sorry .. the MSDN page I looked at was only showing documentation back to 2012. I need more sleep ... — Gavin Sutherland, Aug 07 '14 at 15:14
@GavinSutherland: Microsoft has a bad habit of removing old documentation, and references to old software versions, from MSDN when old OS/app versions are no longer supported by Microsoft. They don't seem to realize that USERS still need to support old versions and MSDN's practice of removing old docs/references can be a hindrance to that effort. — Remy Lebeau, Aug 07 '14 at 15:36
Hi @RemyLebeau and Gavin. I realize that this is just over 5 years old now, but the accepted answer is incorrect and misleading. Not sure if you have had a chance to review the answer that I posted recently, but it explains what is actually happening. — Solomon Rutzky, Jan 06 '20 at 00:18
Hi @SolomonRutzky. Sorry, I've just not had a chance to get round to looking at this. I'll try and get round to it this week. — Gavin Sutherland, Jan 06 '20 at 12:49

score 3 · Answer 1 · answered Aug 07 '14 at 02:08

3

Your pipe character is using Unicode codepoint U+00A6 BROKEN BAR instead of U+007C VERTICAL LINE. U+00A6 is outside of ASCII. VARCHAR does not support non-ASCII characters. That is why you have to use NVARCHAR instead, which is designed to handle Unicode data.

answered Aug 07 '14 at 02:08

Remy Lebeau

555,201
31
458
770

I now understand that `U+00A6 BROKEN BAR` is a Unicode codepoint but just to fully understand why my last example worked (the one after the edit with no nvarchar conversion) ... is this because the SQL_Latin1_General_CP1_CI_AS collation code page [Windows 1252](http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx) handles a limited set of Unicode characters (of which broken bar is one)? – Gavin Sutherland Aug 07 '14 at 15:46
1

Latin-1 (aka ISO-8859-1) and Win1252 are not the same charset. Win1252 is commonly mislabeled/misused as Latin-1 due to their similarities. However, U+00A6 is included in both charsets. – Remy Lebeau Aug 07 '14 at 16:42

Solomon Rutzky · Accepted Answer · 2020-01-06T16:25:52.060

Remy's answer is, unfortunately, incorrect. VARCHAR absolutely does support Extended ASCII. Standard ASCII is only the first 128 values (0x00 - 0x7F). That happens to be the same for all code pages (i.e. 8-bit VARCHAR data) and UTF-16 (i.e. 16-bit NVARCHAR data) in SQL Server. Extended ASCII covers the remaining 128 of the 256 total values (0x80 - 0xFF). These 128 values / code points differ per code page, though there is a lot of overlap between some of them.

Remy states that VARCHAR does not support U+00A6 BROKEN BAR. This is easily disproven by simply adding SELECT @Data; after the first line:

DECLARE @Data VARCHAR(1000) =
                 '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT @Data;

That returns:

<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>

The ¦ character is clearly supported, so the problem must be something else.

It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8.

The broken pipe character is a valid character in UTF-8. The problem is: you aren't passing in UTF-8 data. Yes, you state that the encoding is UTF-8 in the xml declaration, but that doesn't mean that the data is UTF-8, it merely sets the expectation that it needs to be UTF-8.

You are converting a VARCHAR literal into XML. Your database's default collation is SQL_Latin1_General_CP1_CI_AS which uses the Windows-1252 code page for VARCHAR data. This means that the broken vertical bar character has a value of 166 or 0xA6. Well, 0xA6 is not a valid UTF-8 encoded anything. If you were truly passing in UTF-8 encoded data, then that broken vertical bar character would be two bytes: 0xC2 and then 0xA6. If we add that 0xC2 byte to the original input value (the 0xA6 is the same, so we can keep that where it is), we get:

DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test'
                              + CHAR(0xC2) + '¦</NewDataSet>';
SELECT @Data AS [@Data];
SELECT CAST(@Data AS XML) AS [DataXml];

and that returns:

<?xml version="1.0" encoding="utf-8"?><NewDataSet>TestÂ¦</NewDataSet>

followed by:

<NewDataSet>Test¦</NewDataSet>

This is why removing the encoding="utf-8" fixed the problem:

with it there, the bytes of that string needed to actually be UTF-8 but they weren't, and ...
with it removed, the encoding is assumed to be the same as the string itself, which is VARCHAR, and that means the encoding is the code page associated with the collation of the string, and a VARCHAR literal or variable uses the database's default collation. Meaning, in this context, either without the encoding="xxxxxx", or with encoding="Windows-1252", the bytes will need to be encoded as Windows-1252, and indeed they are.

Putting this all together, we get:

If you have an actual UTF-8 encoded string, then it can be passed into the XML datatype, but you need to have:
1. no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
2. the XML declaration stating that the encoding is UTF-8
If you have a string encoded in the code page that is associated with the database's default collation, then you need to have:
1. no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
2. either no "encoding" as part of an <?xml ?> declaration, or have encoding set to the code page associated with the database's default collation (e.g. Windows-1252 for code page 1252)
If your string is already Unicode, then you need to:
1. prefix a string literal with an upper-case "N" or use an NVARCHAR variable or column for the incoming XML
2. have either no "encoding" as part of an <?xml ?> declaration, or have encoding set to "utf-16"

Please see my answer to "Converting accented characters in varchar() to XML causing “illegal XML character”" for more details on this.

And, just to have it stated: while SQL Server 2019 introduced native support for UTF-8 in VARCHAR literals, variables, and columns, that has no impact on what is being discussed in this answer.

For info on collations, character encoding, etc, please visit: Collations Info

Great answer. Thanks. – Gavin Sutherland Jan 06 '20 at 12:56 — Gavin Sutherland, Jan 06 '20 at 12:56

Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?

2 Answers2

Linked