Monday, June 20, 2011

Windows versus Sql collation

There was a question posted by my friend as to which collation to select for a column, whether the “SQL_Latin1_General_Cp1_CI_AS” or the “Latin1_General_CI_AS”.

My first reaction was oops….we normally use the defaults by the SQL server i.e. SQL_Latin1_General_Cp1_CI_AS mostly. The other option is SQL_Latin1_General_Cp1_CS_AS (Case Sensitive) where “ABS” and “abs” are two different values. But trying to find answer to the question I realized that it became a choice between SQL collation or the windows collation.

Here is some dig.
A “collation” specifies how strings are compared and sorted, and what character set is used for non-Unicode data. SQL Server supports two types of collations:
SQL collation: Example: “SQL_Latin1_General_Cp1_CI_AS”
Windows collation: Example: “Latin1_General_CI_AS”

For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data. Both Unicode and non-Unicode sorting are compatible with string comparison rules in a particular version of Windows.

In a SQL collation, SQL Server defines different comparison semantics for non-Unicode data.
A SQL collation’s rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type.

Lets do some hands on here.

Example: Create 2 tables namely tblSQLCol and tblWinCol
        CREATE TABLE tblSQLCol (a1 varchar(50), na1 nvarchar(50)) 
        //Here the first column is a non UNICODE data type and second column is a UNICODE data type
[Note: this will create the columns with SQL_Latin1_General_CP1_CI_AS.  This is because the Setup program does not set the instance default collation to the Windows collation Latin1_General_CI_AS if the computer is using the U.S. English locale. Instead, it sets the instance default collation to the SQL collation SQL_Latin1_General_Cp1_CI_AS. So since the server instance is set with SQL.. collation, when creating table, it takes the collation set at server level. Here CP1 means Code Page [1=1252 the default code page], CI means Case insensitive and AS meaning Accent Sensitive]

        CREATE TABLE tblWinCol (a1 varchar(50) COLLATE  Latin1_General_CI_AS,na1 nvarchar(50) COLLATE          
        //Here the first column is a non UNICODE data type and second column is a UNICODE data type
[Note here the columns will be set with the specified collation]

Insert the data into the table in sql collation
INSERT INTO tblSQLCol VALUES(‘a-c’,'a-c’)
Now insert the data into the table with windows collation
INSERT INTO tblWinCol VALUES(‘a-c’,'a-c’)
INSERT INTO tblWinCol VALUES(‘ab’,'ab’)

Now lets see the difference and affects of SQL and windows collation
select a1,’sql’ from tblSQLCol order by 1 asc  //SORT ASCENDING on first column (non Unicode) retrieved using the SQL collation table

select na1,’sql’ from tblSQLCol order by 1 asc //SORT ASCENDING on first column (Unicode) retrieved using the SQL collation table

select a1,’latin’ from tblWinCol order by 1 asc //SORT ASCENDING on first column (non Unicode) retrieved using the Windows collation table

select na1,’latin’ from tblWinCol order by 1 asc //SORT ASCENDING on first column (Unicode) retrieved using the Windows collation table

NOTE: if you see the result, one can observe that for windows collation, in both cases (Unicode and non Unicode) same sorting and comparison logic is used. In SQL collation two different logics are used.

Now lets us see how any sorting algorithm happens at the .NET level.
Lets test the default sorting order
ArrayList arrlst = new ArrayList(2);
RESULT: {1,2,3}
ArrayList arrlst = new ArrayList(2);
RESULT: {“ab,”a-c”}
Now lets us test with the some code which we generally might do at .NET
SqlConnection conn = new SqlConnection(“server=SERVERNAME; user 
DataSet ds = new DataSet();
SqlDataAdapter adapter = new SqlDataAdapter();
adapter.SelectCommand = new SqlCommand(“select a1 from test”, conn);
adapter.Fill(ds, “test”);
DataRow[] rw = ds.Tables[0].Select(“”,”a1 ASC”);
foreach(DataRow r1 in rw){ MessageBox.Show(r1[0].ToString()); }

Same is the result with adapter.SelectCommand = newSqlCommand(“select na1 from test”, conn); and also with test2 as the table
RESULT: {“ab”,”a-c”}

This clearly shows that with SQL collation and non Unicode data type, at windows application layer different sort logic is applied and at the SQL layer a different logic is applied.

So probably to keep the sorting logic constant across both layers, one can have a Windows collation like the Latin1_General_CI_AS toSQL_Latin1_General_CP1_CI_AS

Regarding conversion of a collation of a column fromSQL_Latin1_General_CP1_CI_AS  to Latin1_General_CI_AS , The column of interest if changed should be checked if it is used in any join conditions. If the joining column is not changed then we will get an error

Example: select * from tblWinCol inner join tblSQLCol on tblWinCol.a1 = tblSQLCol.a1 will give error “Cannot resolve collation conflict for equal to operation.” Same should also be checked with any temp table joins. Not sure if any data losses will occur. Ideally it should not occur

This was done on SQL 2005 and was a nice learning that i thought should be shared.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.


Related Posts with Thumbnails