Peter Larsson Blog

Patron Saint of Lost Yaks

CHECKSUM weakness explained

The built-in CHECKUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See here in a previous forum post http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832 for more explanation.

Today, I wanted to see how often a collision (or false positive) can occur.

Let's take a very simple CHECKSUM value, for example 123. Decimal 123 is "01111011" in binary representation.
Since CHECKSUM function rotates the iterative checksum value 4 bits to the left (same thing as multiplying by 16), how many permutations of two characters returns the same CHECKSUM value of 123? The answer is 16 permutations.

Let's investigate by writing down the solution of this two character replacement.

     abcd 1011      Second character
0000 aBCD           First character
--------------
0000 0111 1011


Before we do anything more, let's refresh our mind on how XOR works. 0 and 0 returns 0, 0 and 1 returns 1, 1 and 0 return 1 and finally 1 and 1 return 0.

_ _|_0_|_1_|
_0_|_0_|_1_|
_1_|_1_|_0_|


Why did I then write the mysterious abcd and aBCD in the formula above?

Because the resulting bit sequence should be 0111, and only the same bits xor'ed gives 0. And the reason to use lower and upper characters for remaining bits are also due to xor. A result of 1 means the two bits has to be the opposite of each other; 0 xor 1 gives 1 and 1 xor 0 also gives 1 as result.

So "a" can have 2 possible values, and so can "b", "c", and "d" have. It gives 16 permutations in total.
See here

Bitseq1 Bitseq2 xor
0000    0111    0111
0001    0110    0111
0010    0101    0111
0011    0100    0111
0100    0011    0111
0101    0010    0111
0110    0001    0111
0111    0000    0111
1000    1111    0111
1001    1110    0111
1010    1101    0111
1011    1100    0111
1100    1011    0111
1101    1010    0111
1110    1001    0111
1111    1000    0111

This table above can be applied for any number of character for the first or last character.

Can this be applied onto a 3 character replacement? Of course, and in the same way. Still using the original decimal value of 123, the used characters are these


          abcd 1011      Third character
     xxxx aBCD           Second character
0000 xxxx                First character
-------------------
0000 0000 0111 1011



As you now can see, there are 16 times 16 possible (256) permutations giving the same CHECKSUM value!

And for four characters, the map looks like this


               abcd 1011      Fourth character
          yyyy aBCD           Third character
     xxxx yyyy                Second character
0000 xxxx                     First character
------------------------
0000 0000 0000 0111 1011



As you now can see, there are 16 times 16 times 16 permutations (4096) giving the same CHECKSUM value

This means I can with any two characters create same original checksum value 16 times. With any three characters (with a total of 65536 permuations), I can create 256 combinations resulting in the same decimal 123 checksum value.

Let's now see how the likelyness of getting a collision (or a false positive) is.

Character      256^(c-1)     16^(c-1)
replacment     Permutations  Collisions
2                       256          16
3                     65536         256
4                  16777216        4096
5                4294967296       65536
6             1099511627776     1048576
7           281474976710656    16777216

From the table above, we can deduct that the number of collisions are the square root of the number possible permutations. This can be interpreted as "if you double the number of records, and calculate a checksum value, the likelyness of getting a collision (false positive) is quadrupled."
Triple the number of records and you have 9 times more chance of getting a collision (false positive).

The reason I don't give more than 7 character example is that for the eight character, the leftmost 4 bits are rotated and xored to the lower part of the iterative CHECKSUM value. It doesn't matter really, because then the flushed bits must match the final "1011" in this example, so the other part still has 16 permutations.