Collation Hell (Part 2)

Sun Aug 16, 2009 by Dan Guzman

In my last post, I discussed why one should avoid a mixed collation environment and how to choose the right collation for your environment. This post focuses on planning a collation change.

Should You Change Existing Collations?

Once you choose a standard collation (or at least a preferred one) for your organization, you'll need to decide if the change to existing instances, databases and columns is worth the effort and risk. Keep in mind that the effort involves not only the actual collation change but also testing along with possible changes to code and data to maintain the desired behavior. Such a remediation project can be quite significant depending on the old/new collation and scope of the change so you need to weigh the pros and cons to determine if the effort is justified.

Note that changing collations need not be an all-or-none decision; you might choose to convert only some (or none) of your existing instances/databases while enforcing the collation standard for new installations. You can identify the instances that are causing the most grief and weigh those accordingly.

A number of factors influence the effort and risk of a collation change. A change to language, sensitivity and/or code page is often more complex than a conversion from a SQL collation to a Windows collation (or Windows to SQL) of the same language and sensitivity. Let me discuss these scenarios in more detail so that you can better ascertain the effort and risk involved in your environment for planning purposes.

Windows vs. SQL Collation Change

A conversion between a SQL and Windows collation of the same language, sensitivity and code page ought to be fairly straightforward due to the same character set and similar comparison rules. As with any collation change, there are differences in behavior though. The main difference here is that Windows collations use word sort behavior so slightly different sorting/comparison behavior will result. The script below shows such a difference with identical data

--SQL collation: compares greater than

IF 'coop' COLLATE SQL_Latin1_General_CP1_CI_AS < 'co-op' COLLATE SQL_Latin1_General_CP1_CI_AS

PRINT 'less than'

ELSE IF 'coop' COLLATE SQL_Latin1_General_CP1_CI_AS = 'co-op' COLLATE SQL_Latin1_General_CP1_CI_AS

PRINT 'equal'

ELSE IF 'coop' COLLATE SQL_Latin1_General_CP1_CI_AS > 'co-op' COLLATE SQL_Latin1_General_CP1_CI_AS

PRINT 'greater than'

ELSE PRINT 'UNKNOWN'

--Windows collation: compares less than

IF 'coop' COLLATE Latin1_General_CI_AS < 'co-op' COLLATE Latin1_General_CI_AS

PRINT 'less than'

ELSE IF 'coop' COLLATE Latin1_General_CI_AS = 'co-op' COLLATE Latin1_General_CI_AS

PRINT 'equal'

ELSE IF 'coop' COLLATE Latin1_General_CI_AS > 'co-op' COLLATE Latin1_General_CI_AS

PRINT 'greater than'

ELSE PRINT 'UNKNOWN'

All things being equal, a conversion from/to a Windows collation will likely require few changes, if any, to code and schema (besides the collation change). On the other hand, converting to a collation of different sensitivity and/or character set is often be more challenging

Sensitivity Change

You might recall that the instance collation determines the sensitivity for variable names and labels while the database collation determines sensitivity of identifiers and literals. I always match characters exactly in variable names, labels identifiers (including table aliases) regardless of whether I'm using a sensitive or insensitive collation and never use names that differ only by case. Not only does naming consistency make code cleaner, this practice facilitates moving between collations. However, it is unlikely that all database developers were so anal in their naming so be aware that you'll probably need to make code or schema changes in order to convert between collations of different sensitivity.

A change from a case-sensitive collation to a case-insensitive one is usually minor, at least from a code perspective. The same schema/code that runs in a case-sensitive environment will run in a case-insensitive collation as long as you don't encounter names and identifiers in the same scope that differ only by case (e.g. @customerID and @CustomerID). Such a deliberate practice is uncommon in my experience but these conflicts must be addressed before changing to a case-insensitive collation.

One usually strives to store and query data using a consistent case (especially all upper/lower) under a case-sensitive collation. If this practice was not followed, data that was unique under a case-sensitive collation will not be regarded as such under case-insensitive rules and prevent unique indexes (including primary key or unique constraints) from begin created. This might actually be a good thing when the real issue is bad data (i.e. duplicates inadvertently allowed due to inconsistent case). However, you may need to deviate from the case-insensitive standard at the column level in some situations due to business requirements, such as to enforce uniqueness of case-sensitive part numbers.

Going from a case-insensitive to a case-sensitive or binary collation (which I don't personally recommend) will typically require more changes. Developers tend to be a bit sloppy with matching case under a case-insensitive collation because there is no requirement to do so. Don't be surprised if a lot of code and queries must be changed once variables and identifiers become case sensitive. Furthermore, you may need to update data to a consistent case and also make application changes to ensure data are stored in a consistent case.

The considerations that apply to case sensitivity also apply to other collation sensitivity options (accent, Kana and width). I wouldn't expect as many issues compared to a change in case sensitivity in most cases, though.

Character Set Change

A change in code page is a non-issue when char/varchar/text data contains only ASCII characters. If you have a character outside the ASCII range (0-127, 0x00-0x7F), a code page change will present a problem when the character doesn't also exist in the target collation's code page. Such a character will instead be mapped to an alternate character (e.g. 'À' to 'A' in example below) or the catch-all '?' (e.g. '€' to '?' in example below). If this mapping is unacceptable, you'll need to change the data type to Unicode (nchar/nvarchar/ntext) or update data to conform to the target code page.

CREATE TABLE dbo.Foo(

Bar char(1) COLLATE Latin1_General_CI_AS

);

INSERT INTO dbo.Foo (Bar) VALUES('A');

INSERT INTO dbo.Foo (Bar) VALUES('À');

INSERT INTO dbo.Foo (Bar) VALUES('€');

--list values not mapped identically

SELECT Bar AS OriginalValue, Bar COLLATE Japanese_90_BIN AS MappedValue

FROM dbo.Foo

WHERE

CAST(CAST(Bar AS nvarchar(MAX)) AS varbinary(MAX)) <>

CAST(CAST(Bar COLLATE Japanese_90_BIN AS nvarchar(MAX)) AS varbinary(MAX));

OriginalValue	MappedValue
À	A
€	?

If you are unsure if you have problem characters, the above script shows one method to identify these. This script converts the original collation characters to Unicode and then to varbinary and repeats the technique for the target collation. An inequality of the two values indicates an inexact mapping that may require remediation.

Language Change

I'm sure some of you have inherited different language collations due to mergers and acquisitions or inattention to detail during installation. Be mindful that the topic of supporting multiple languages/locales is much larger than just collation. I'm only discussing a collation language change here but if you need to fully support multiple languages in a single database, you must also consider other factors such as a schema that supports multiple translations, currency and UOM conversion and applications that are sensitive to client locale.

You may experience different behavior after a collation language change due to the different sorting and comparison semantics. The script below illustrates such a difference. Even if you chose a collation that supports the majority of your users' languages, that collation might be less than ideal for the user minority. Consider performing some operations in application code instead of SQL Server when the standard collation behavior is unacceptable for the task at hand.

--returns both 'Schröder' and 'Schroeder'

DECLARE @Foo TABLE(

LastName nvarchar(10) COLLATE German_PhoneBook_CI_AS);

INSERT INTO @Foo VALUES(N'Schröder');

INSERT INTO @Foo VALUES(N'Schroeder');

SELECT LastName FROM @Foo

WHERE LastName LIKE N'%oe%';

--returns only 'Schroeder'

DECLARE @Foo TABLE(

LastName nvarchar(10) COLLATE Latin1_General_CI_AS);

INSERT INTO @Foo VALUES(N'Schröder');

INSERT INTO @Foo VALUES(N'Schroeder');

SELECT LastName FROM @Foo

WHERE LastName LIKE N'%oe%';

Summary

A collation change effort varies considerably depending on the size and complexity of the environment. Perform due diligence before embarking on a collation change. I don't want to discourage anyone from changing collations but as much as a mixed collation environment is a pain, a botched remediation project is even worse. Be sure to plan accordingly.

I'll share different methods to change collations in my last post of this series.

Legacy Comments


Henri 2009-08-18	re: Collation Hell (Part 2) Collation is something you always have to reckon with. Which collation is choosen is not very relevant if the database is going to exist in more than one place, just make sure you alway take collation into account with temp tables. Change collation is a bitch. Never try to change it on a production database. How about the sharp S (which is not displayed in my comment) :-) :-) :-)

avat 2009-11-05	re: Collation Hell (Part 2) Thanks a lot! Although my problem was not among the ones described here, this page was very helpful in diagnosing it. And +1 for the title :-)

Dan Guzman 2010-01-26	re: Collation Hell (Part 2) Below is an example script that changes the collation of the AdventureWorks Person.Address columns. To perform the change, indexes referencing the altered columns are dropped beforehand and recreated after the collation change. DROP INDEX Person.Address.IX_Address_AddressLine1_AddressLine2_City_StateProvinceID_PostalCode; ALTER TABLE Person.Address ALTER COLUMN AddressLine1 nvarchar(60) COLLATE Latin1_General_CS_AS NOT NULL; ALTER TABLE Person.Address ALTER COLUMN AddressLine2 nvarchar(60) COLLATE Latin1_General_CS_AS NULL; ALTER TABLE Person.Address ALTER COLUMN City nvarchar(30) COLLATE Latin1_General_CS_AS NOT NULL; ALTER TABLE Person.Address ALTER COLUMN PostalCode nvarchar(15) COLLATE Latin1_General_CS_AS NOT NULL; CREATE UNIQUE NONCLUSTERED INDEX IX_Address_AddressLine1_AddressLine2_City_StateProvinceID_PostalCode ON Person.Address ( AddressLine1, AddressLine2, City, StateProvinceID, PostalCode );

Lisa 2012-02-28	re: Collation Hell (Part 2) WE8ISOxxx 8bit.Admittedly this exmlpae going from a 7 to an 8 may not give you any difficulties, as there will be no Euro signs or umlauts for instance and they are correlated, but if you had a 7bit character set (that was a subset of a different 8bit character set) then you changed to the above mentioned (unrelated) 8bit character set you may well have unexpected (and wrong) character translations when that data is accessed. Just thinking out loud.A safer way surely would be to export all the data using a bulk unload utility, change the character set of the database, then re-import the data so it gets properly converted by the inbuilt from character set to character set routines.Is there not an export and import data utility for MySQL ?