Collation Hell (Part 1)

Sun Jul 26, 2009 by Dan Guzman

I inherited a mixed collation environment with more collations than I can count on one hand. The different collations require workarounds to avoid "cannot resolve collation conflict" errors and those workarounds kill performance due to non-sargable expressions. Dealing with mixed collations is a real pain so I strongly recommend you standardize on a single collation and deviate only after careful forethought. Here's a brief overview of collations and some guidance to help you choose the right collation for your organization and new SQL installations.

Collation Overview

A collation determines the rules SQL Server uses to compare and sort character data. These rules are language/locale aware and may also be sensitive to case, accent, Kana and width. Collation suffixes identify dictionary rule (in)sensitivity: _CS (case sensitive), _CI (case insensitive), _AS (accent sensitive), _AI (accent insensitive) and _KS (Kana sensitive). Binary collations, identified by suffixes _BIN (binary) and _BIN2 (binary-code point), are sensitive in all regards.

A collation determines which characters can be stored in non-Unicode character data types and the bit patterns used for storage. Char, varchar and text data types can store only 256 different characters due to the single byte limitation. The first 128 characters (0-127, 0x00-0x7F) are the same for all collations as defined by the ASCII character set and the remaining 128 characters (128-255, 0x80-0xFF) vary according to the code page associated with the collation. Characters without an associated code point are mapped to an either an alternate character or to the catch-all '?' character.

Collations are grouped into Windows and SQL collations. Windows collations provide sorting and comparison behavior consistent with applications running on a computer with the corresponding Windows operating system locale. Windows collation also provide consistent behavior for both Unicode and non-Unicode data types.

SQL collations use different rules for non-Unicode and Unicode types. SQL Server collations, identified with the SQL_ collation name prefix, use the character set and sort order settings from older SQL Server versions for non-Unicode types and are provided specifically to maintain compatibility with existing SQL Server installations. Both SQL and Windows collations use the same rules for Unicode types.

Specifying a Collation

Collation can be specified at the instance, database, column and expression level. The SQL Server instance collation is determined during SQL Server installation and cannot be changed without a reinstall/rebuild. It's a good idea to get the collation right the first time unless you need practice re-installing SQL Server. Keep in mind that the instance collation determines the collation (including case-sensitivity) of Instance-level objects like logins and database names as well as identifiers for variables, GOTO labels and temporary tables. Passwords are always case-sensitive in SQL 2005 and above, although collation determined password sensitivity in earlier versions.

The database collation is determined when the database is created. If not specified otherwise, the instance default collation is used as the database collation. Database-level identifiers like table and column names use the database collation as do literal expressions. The database collation can be changed at any time but this does not change the collation of existing table columns.

Column collation for character data is specified when the table is created or when the column added to the table. If not specified otherwise, the database collation is used. A column's collation can be changed only by altering the column with the new collation or recreating the table with the new collation specified on the column definition. If you want a column's collation to remain different than the database default collation, you must be careful to explicitly specify the collation whenever the column is altered so that it not inadvertently changed to the database default collation.

Choosing the Right Collation

The default collation that the SQL Server installer chooses is not necessarily the Microsoft recommended one or the one that is best for your environment. SQL Server setup examines the operating system locale and chooses the default as the oldest available version associated with the locale. For example, a SQL Server installation in the US will default to SQL_Latin1_General_CP1_CI_AS and the installation default in the UK will be Latin1_General_CI_AS. In both cases, Microsoft recommends a Windows collation (e.g. Windows Latin1_General_CI_AS) unless one needs to maintain compatibility with existing installations. More on that shortly.

Language is the most important consideration in choosing a collation for a new installation. This is one reason why the SQL Server installer chooses the default collation based on the operating system locale. If all users speak the same language, choose a collation that supports the language/locale. This will help ensure expected sorting and comparison behavior along with alphabet support for non-Unicode types. In a multi-language environment, choose a collation with the best overall support for the languages used.

Another major consideration is collation compatibility. If you have existing SQL installations, consider using the same collation for a new instance if you envision sharing data via replication, SSIS or future server consolidation. I previously mentioned that Microsoft recommends a Windows collation but it may be better to revert to a SQL collation for compatibility with older instances in your environment that already use the SQL collation. Compatibility is another reason why the installation default is SQL_Latin1_General_CP1_CI_AS collation in the US. Unfortunately, this default has the side effect of DBAs unwittingly installing new instances with a SQL collation instead of a Windows collation like Latin1_General_CI_AS even when compatibility isn't needed.

The choice of whether or not to choose a case sensitive collation is a bit subjective. A case insensitive collation is appropriate when you need to query data regardless of the case of the actual data. For example, this allows one to easily find customers with a last name of 'Smith' even when data is not stored in proper case. With a case sensitive collation, it is important that one stores data in a consistent case (not to say that one shouldn't anyway) and this places more burden on application and database developers.

Collation Performance

Collation performance was a bigger deal back in the days of 486 processors (instead of collation, it was actually character set and sort order back then). The comparative performance on modern processors is usually insignificant. SQL collations should provide better performance than Windows collations for non-Unicode types due to simpler comparison rules but the difference is significant only in the most severe circumstances, such as a table scan with LIKE '%Some String%' in the WHERE clause. See Comparing SQL collations to Windows collations. Binary collations are said to provide the best performance but the cost of unnatural (non-dictionary) comparisons and sort order is high; most users would expect 'a' to sort before 'B' but that is not the case with binary collations.

I personally don't think performance should even be considered in choosing the proper collation. One of the reasons I'm living in collation hell is that my predecessors chose binary collations to eke out every bit of performance for our highly transactional OLTP systems. With the sole exception of a leading wildcard table scan search, I've found no measurable performance difference with our different collations. The real key to performance is query and index tuning rather than collation. If performance is important to you, I recommend you perform a performance test with your actual application queries before you choose a collation on based on performance expectations.

Summary

My general recommendation is that you should use a case insensitive Windows collation appropriate for your locale unless you need to maintain compatibility with existing SQL instances or have special considerations. In my next post, I'll discuss changing collations so that you can avoid a mixed collation environment and show different methods to accomplish the task.

Legacy Comments


Uri Dimant 2009-07-27	re: Collation Hell (Part 1) Hi Dan There is great series of blogs you have been started. One of our clients has faced the problem where a server collation differes with a database collation. The problem arised when the developer has stoeded the unicode (hebrew) in temporary table which is a system database..:-)

Younes Abesi 2009-07-31	re: Collation Hell (Part 1) Hi Dan, Do you have recommendation for adding internationalization support to current web based ASP/ASPET/SQL Server 2005 application. Supporting collations is seemed to take 90% of the scope of the project while it should not.

Dan Guzman 2009-08-01	re: Collation Hell (Part 1) It's reasonably easy to support internationalization on the database side. Use Unicode for all varchar and char columns that have multilingual content so that the character set is not a concern. Develop a schema that supports multiple translations of the same data when needed (e.g. product descriptions). Parameterize all queries so that date formats and decimal separators are not an issue. Most of the burden then moves to the client application code where one can perform sorting and string comparisons using the user's culture. Frankly, it's very hard and cost-ineffective to develop applications that support every possible locale and browser and perform well too. There are compromises one must make and those should be driven by business needs. Sometimes, close is good enough. I'm not a web developer but my understanding is that globalization support in classic ASP was extremely limited compared with ASP.NET, which introduced features like the CultureInfo class to facilitate culture aware string comparisons and sorting as well as presentation formatting.

sean redmond 2009-08-03	re: Collation Hell (Part 1) Collation, Unicode and Diskspace: Fond as I am of Unicode and the ability to store many scripts without hassle, I do have to wonder whether Unicode for all text fields is a waste of space. Unicode fields are twice the size of standard text-fields and text-fields can occupy a large proportion of a database. If one is not using different scripts, does the convenience of Unicode make up for the significant increase in size that comes with Unicode textfields? I do admit that mixed collations are a pain and can be very time-consuming. Would it be worth the author's while, I wonder, to convert all of the textfields to Unicode and then back to one collation? Needless to say, there would have to be stringent checks to see that no textdata was altered.

Richard 2012-03-24	re: Collation Hell (Part 1) Any recommendations on collation for an nvarchar column that will store both English and Asian words please?