CHECKSUM weakness explained
The built-in CHECKUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See here in a previous forum post http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832 for more explanation.
Today, I wanted to see how often a collision (or false positive) can occur.
Let's take a very simple CHECKSUM value, for example 123. Decimal 123 is "01111011" in binary representation.
Since CHECKSUM function rotates the iterative checksum value 4 bits to the left (same thing as multiplying by 16), how many permutations of two characters returns the same CHECKSUM value of 123? The answer is 16 permutations.
Let's investigate by writing down the solution of this two character replacement.
abcd 1011 Second character
0000 aBCD First character
--------------
0000 0111 1011
Before we do anything more, let's refresh our mind on how XOR works. 0 and 0 returns 0, 0 and 1 returns 1, 1 and 0 return 1 and finally 1 and 1 return 0.
_ _|_0_|_1_|
_0_|_0_|_1_|
_1_|_1_|_0_|
Why did I then write the mysterious abcd and aBCD in the formula above?
Because the resulting bit sequence should be 0111, and only the same bits xor'ed gives 0. And the reason to use lower and upper characters for remaining bits are also due to xor. A result of 1 means the two bits has to be the opposite of each other; 0 xor 1 gives 1 and 1 xor 0 also gives 1 as result.
So "a" can have 2 possible values, and so can "b", "c", and "d" have. It gives 16 permutations in total.
See here
Bitseq1 Bitseq2 xor
0000 0111 0111
0001 0110 0111
0010 0101 0111
0011 0100 0111
0100 0011 0111
0101 0010 0111
0110 0001 0111
0111 0000 0111
1000 1111 0111
1001 1110 0111
1010 1101 0111
1011 1100 0111
1100 1011 0111
1101 1010 0111
1110 1001 0111
1111 1000 0111
This table above can be applied for any number of character for the first or last character.
Can this be applied onto a 3 character replacement? Of course, and in the same way. Still using the original decimal value of 123, the used characters are these
abcd 1011 Third character
xxxx aBCD Second character
0000 xxxx First character
-------------------
0000 0000 0111 1011
As you now can see, there are 16 times 16 possible (256) permutations giving the same CHECKSUM value!
And for four characters, the map looks like this
abcd 1011 Fourth character
yyyy aBCD Third character
xxxx yyyy Second character
0000 xxxx First character
------------------------
0000 0000 0000 0111 1011
As you now can see, there are 16 times 16 times 16 permutations (4096) giving the same CHECKSUM value
This means I can with any two characters create same original checksum value 16 times. With any three characters (with a total of 65536 permuations), I can create 256 combinations resulting in the same decimal 123 checksum value.
Let's now see how the likelyness of getting a collision (or a false positive) is.
Character 256^(c-1) 16^(c-1)
replacment Permutations Collisions
2 256 16
3 65536 256
4 16777216 4096
5 4294967296 65536
6 1099511627776 1048576
7 281474976710656 16777216
From the table above, we can deduct that the number of collisions are the square root of the number possible permutations. This can be interpreted as "if you double the number of records, and calculate a checksum value, the likelyness of getting a collision (false positive) is quadrupled."
Triple the number of records and you have 9 times more chance of getting a collision (false positive).
The reason I don't give more than 7 character example is that for the eight character, the leftmost 4 bits are rotated and xored to the lower part of the iterative CHECKSUM value. It doesn't matter really, because then the flushed bits must match the final "1011" in this example, so the other part still has 16 permutations.