Mladen Prajdić Blog

Blog about stuff and things and stuff. Mostly about SQL server and .Net

Comparing SQL Server HASHBYTES function and .Net hashing

A while back we had an interesting problem at work. We were calculating MD5 hashes for some values in both .Net and SQL Server and although the input values were all the same our MD5 hashes were different. After some time spent looking dumbfounded at the code I’ve realized what the bug was and started laughing. I saw that HashBytes function was the “culprit”. HashBytes function was introduced in SQL server 2005 to simplify creating hashes in the database. It can convert values to MD2, MD4, MD5, SHA, or SHA1 formats. HashBytes' result depends on the input text’s data type and can at the first glance give a bit different results that we might think, especially when combined with the .Net framework. Of course hashing itself works correctly in both. Let’s see what going on with some examples. From here on when i say hash i mean MD5 hash.

Example in SQL Server

-- note the size 15 of all datatypes
declare @val1 varchar(15),
        @val2 nvarchar(15),
        @val3 char(15),
        @val4 nchar(15)

– example of 1 byte/char text – all variables are of same length select @val1 = '1234567890', @val2 = N'1234567890', @val3 = '1234567890', @val4 = N'1234567890'

– all 4 return different results select HASHBYTES('md5', @val1) as MD5_varchar, – result = 0xE807F1FCF82D132F9BB018CA6738A19F – just to show that collation doesn't change the hash HASHBYTES('md5', @val1 collate Cyrillic_General_BIN2) as MD5_varchar_collation, – result = 0xE807F1FCF82D132F9BB018CA6738A19F HASHBYTES('md5', @val2) as MD5_Nvarchar, – result = 0xE15E31C3D8898C92AB172A4311BE9E84 HASHBYTES('md5', @val3) as MD5_char, – result = 0x2120C3F3423F89BA8A65ABD933321884 HASHBYTES('md5', @val4) as MD5_Nchar – result = 0x90DEF5840F3A31174CA44E2022F743B6

– example of 2 bytes/char text

-- 中文 means Chinese in written text
-- converting 2 bytes/char text to varchar cuts the text in half
select  @val1 =  '中文',   -- this cuts the text in half
        @val2 = N'中文',   -- this stores the whole text
        @val3 =  '中文',   -- this cuts the text in half
        @val4 = N'中文'    -- this stores the whole text

-- all 4 return different results
select  HASHBYTES('md5', @val1) as MD5_varchar,     -- result = 0xEA03FCB8C47822BCE772CF6C07D0EBBB
        HASHBYTES('md5', @val2) as MD5_Nvarchar,    -- result = 0x73C6C8CD2F94355EF015E5265D5E65B1
        HASHBYTES('md5', @val3) as MD5_char,        -- result = 0xA13C45A38853677887B4839071537634
        HASHBYTES('md5', @val4) as MD5_Nchar        -- result = 0xEADEBD3BD72A481C43C828E0C550145C
 
The catch here is the data type difference. Nvarchar and nchar take twice more bytes to store data than varchar and char. Since char and nchar pad the right side of the string with spaces to fill the gap up to defined data type length the spaces are also used in MD5 calculation. A collation plays no part in calculating hash values since collations are only applied to sorts and comparisons.

However care must be taken with texts that need 2 bytes per char of storage space like Chinese text. Storing it in a varchar variable cuts it in half.

Example in .Net - C#

For generating MD5 hashes I’ve used the code from this site which turned up as the first result on Google for “MD5 in C#”:

public string GetMD5Hash(string input)
{
    System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
    //byte[] bs = System.Text.Encoding.ASCII.GetBytes(input);
    //byte[] bs = System.Text.Encoding.UTF7.GetBytes(input);
    //byte[] bs = System.Text.Encoding.UTF8.GetBytes(input);
    byte[] bs = System.Text.Encoding.Unicode.GetBytes(input);
    //byte[] bs = System.Text.Encoding.UTF32.GetBytes(input);
    bs = x.ComputeHash(bs);
    System.Text.StringBuilder s = new System.Text.StringBuilder();
    foreach (byte b in bs)
    {
        s.Append(b.ToString("x2").ToLower());
    }
    string password = s.ToString();
    return password;
}

// hashes for the simple “1234567890” text in all encodings ASCII: e807f1fcf82d132f9bb018ca6738a19f UTF7: e807f1fcf82d132f9bb018ca6738a19f UTF8: e807f1fcf82d132f9bb018ca6738a19f Unicode: e15e31c3d8898c92ab172a4311be9e84 UTF32: 6a57502c29a5081f03cb70e0ad38ecc7

// hashes for the complex “中文” text in all encodings ASCII: ea03fcb8c47822bce772cf6c07d0ebbb UTF7: eb02105e5c51a33f21e8da7f8102cfda UTF8: a7bac2239fcdcb3a067903d8077c4a07 Unicode: 73c6c8cd2f94355ef015e5265d5e65b1 UTF32: 65fe91b81ed1107566f9f9f5ed4ccaf1

 

All strings in .Net store their chars in 2 bytes by default. When hashing values we have to take this into account and use proper text Encoding. .Net supports 5 encodings: ASCII (7 bits per char), UTF7 (7 bits per char), UTF8 (8 bits = 1 byte per char), Unicode (UTF-16) (16 bits = 2 bytes per char) and UTF32 (32 bits = 4 bytes per char).

For text with only first 127 chars in the ASCII table ASCII, UTF7 and UTF8 encodings all return the same hash, but with UTF16 and UTF32 comes endianness so they don’t return the same hash values. Also note that the ASCII and UTF7 encoding aren’t recommended to be used anymore except in legacy apps. UTF8 should be used instead.

For text with complex chars we have to use Unicode or UTF32 or we loose char information.

Putting .Net and SQL Server together

 

It turns out that such a simple thing as hashing can become a serious issue if we’re not careful. Trouble always awaits when dealing with text and encodings. :)

When using SQL Server’s varchar data type the .Net encoding to go with is UTF8 since it’s the fastest and most optimized of the three (ASCII, UTF7, UTF8). When using the nvarchar data type to go is Unicode (UTF16) but we also have to know the texts endianness to create correct hashes. UTF32 is practically useless in this case because SQL Server doesn’t have a data type that stores text in 4 bytes/char so we’ll never get the same results if we use it.

This advice only applies when creating hashes both in .Net and SQL server and comparing them. If we’re creating hashes in .Net and only store them in a database then we don’t have to worry about this.

 

kick it on DotNetKicks.com

Legacy Comments


Gecko Coder
2009-08-26
re: Comparing SQL Server HASHBYTES function and .Net hashing
I'm glad you ran in to this. I found the same thing:

This:
select substring(master.dbo.fn_varbintohexstr(hashbytes('MD5', CONVERT(varbinary, 'test@test.com'))), 3, 32)

is different from this:

select substring(master.dbo.fn_varbintohexstr(hashbytes('MD5', CONVERT(varbinary, email))), 3, 32)

...when email is an nvarchar field.

My Solution: create a varchar (not nvarchar) field and set it to email. And then, convert.

Codematrix
2009-11-04
re: Comparing SQL Server HASHBYTES function and .Net hashing
This doesn't with something like this:

Αυθεντικόκείμενο


Florian Reischl
2010-01-13
re: Comparing SQL Server HASHBYTES function and .Net hashing
This blog entry was referenced in a German developer forum:
http://www.mycsharp.de/wbb2/thread.php?threadid=80354

Greets
Flo

Eric North
2010-03-19
re: Comparing SQL Server HASHBYTES function and .Net hashing
Thanks for this. I've been pulling my hair out trying to figure out.

Here's the code for anyone who wants it:

string password = "test";
SHA1 hash = new SHA1CryptoServiceProvider();

hash.ComputeHash(Encoding.ASCII.GetBytes(password));
//at this point hash.hash will be the same as HashBytes('sha1', cast('test' as varchar(30)))

//OR

hash.ComputeHash(Encoding.Unicode.GetBytes(password));
//at this point hash.hash will be the same as HashBytes('sha1', cast('test' as Nvarchar(30)))