Remy's answer is, unfortunately, incorrect. VARCHAR
absolutely does support Extended ASCII. Standard ASCII is only the first 128 values (0x00 - 0x7F). That happens to be the same for all code pages (i.e. 8-bit VARCHAR
data) and UTF-16 (i.e. 16-bit NVARCHAR
data) in SQL Server. Extended ASCII covers the remaining 128 of the 256 total values (0x80 - 0xFF). These 128 values / code points differ per code page, though there is a lot of overlap between some of them.
Remy states that VARCHAR
does not support U+00A6 BROKEN BAR
. This is easily disproven by simply adding SELECT @Data;
after the first line:
DECLARE @Data VARCHAR(1000) =
'<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT @Data;
That returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
The ¦
character is clearly supported, so the problem must be something else.
It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8.
The broken pipe character is a valid character in UTF-8. The problem is: you aren't passing in UTF-8 data. Yes, you state that the encoding is UTF-8 in the xml declaration, but that doesn't mean that the data is UTF-8, it merely sets the expectation that it needs to be UTF-8.
You are converting a VARCHAR
literal into XML. Your database's default collation is SQL_Latin1_General_CP1_CI_AS
which uses the Windows-1252 code page for VARCHAR
data. This means that the broken vertical bar character has a value of 166 or 0xA6. Well, 0xA6 is not a valid UTF-8 encoded anything. If you were truly passing in UTF-8 encoded data, then that broken vertical bar character would be two bytes: 0xC2 and then 0xA6. If we add that 0xC2 byte to the original input value (the 0xA6 is the same, so we can keep that where it is), we get:
DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test'
+ CHAR(0xC2) + '¦</NewDataSet>';
SELECT @Data AS [@Data];
SELECT CAST(@Data AS XML) AS [DataXml];
and that returns:
<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>
followed by:
<NewDataSet>Test¦</NewDataSet>
This is why removing the encoding="utf-8"
fixed the problem:
- with it there, the bytes of that string needed to actually be UTF-8 but they weren't, and ...
- with it removed, the encoding is assumed to be the same as the string itself, which is
VARCHAR
, and that means the encoding is the code page associated with the collation of the string, and a VARCHAR
literal or variable uses the database's default collation. Meaning, in this context, either without the encoding="xxxxxx"
, or with encoding="Windows-1252"
, the bytes will need to be encoded as Windows-1252, and indeed they are.
Putting this all together, we get:
If you have an actual UTF-8 encoded string, then it can be passed into the XML datatype, but you need to have:
- no upper-case "N" prefixing the string literal, and no
NVARCHAR
variable or column being used to contain the string
- the XML declaration stating that the encoding is UTF-8
If you have a string encoded in the code page that is associated with the database's default collation, then you need to have:
- no upper-case "N" prefixing the string literal, and no
NVARCHAR
variable or column being used to contain the string
- either no "encoding" as part of an
<?xml ?>
declaration, or have encoding set to the code page associated with the database's default collation (e.g. Windows-1252
for code page 1252)
If your string is already Unicode, then you need to:
- prefix a string literal with an upper-case "N" or use an
NVARCHAR
variable or column for the incoming XML
- have either no "encoding" as part of an
<?xml ?>
declaration, or have encoding set to "utf-16"
Please see my answer to "Converting accented characters in varchar() to XML causing “illegal XML character”" for more details on this.
And, just to have it stated: while SQL Server 2019 introduced native support for UTF-8 in VARCHAR
literals, variables, and columns, that has no impact on what is being discussed in this answer.
For info on collations, character encoding, etc, please visit: Collations Info